Q&A with the NYT Crossword Archive

I’ve been doing a lot of crosswords lately, so I thought I’d spend some time playing around with old crosswords to see if there was anything interesting to say about them. This whole post is structured as a series of questions I or other people asked, which I then answered using data from this repository. If there’s anything not covered here you’d like to know about, feel free to get in touch with me!

Just for reference, there are 1229405 clue/answer pairs, taken from 14524 puzzles. After throwing out duplicates, there are 140628 distinct answers, and 696050 distinct clues.

1. The Words

Question. What are the most common words? 
It turns out that these are almost exclusively short, vowel-rich words, with the top 5 being:

Word Occurrences
AREA 894
ERA 880
ERIE 818
ERE 750
ALOE 723

These aren’t especially interesting, so I then filtered by length i.e. found the most common word of each length. They’re a bit unwieldy to format here, but you can see them all here. The smaller ones are still boring, but the longer ones are cute, and the sudden uptick in frequencies for the final block is especially helpful: there just aren’t that many very long words appearing in crosswords, it seems.

Question. What is the longest word? 
This turns out to be a bit tricky to answer: doing the obvious sorting gives the answer of:

“TENNINEEIGHTSEVENSIXFIVEFOURTHREETWOONEITSTWOOOTWO,”

which is from the 12/30/2001 puzzle, which I couldn’t find a copy of. The second longest is the theme answer from this puzzle, which makes it clear that the previous one was probably a rebus too.

If you don’t count a rebus answer (which is reasonable), then there’s a whole slew of 23 letter words coming from a time when the Sunday was 23 by 23. If you don’t want to count those either, then you’re left with the pile of 21 letter words that span modern Sunday grids.

Question. What word has the most distinct clues?
The answer here turns out to be for the most part, the same as the common words:

Word Distinct Clues
ERA 462
ERIE 439
ONE 432
ARIA 424
ORE 405

Just to take a peek under the hood, all of the distinct clues for ERA can be seen here, and they seem to fall into three categories:

  • ERA like the time period
  • ERA like the baseball statistic
  • ERA like the Equal Rights Amendment

So, on a spiritual level, there’s only 3 clues for ERA. On the other hand, ONE actually has many meaningfully different clues, all of which are listed here. I wish there was some way to capture this sense of “spiritually different clues” and then find the answer with the most “spiritually diverse” clues, but I can’t think of a good algorithmic way to do it.

Question. What words have very few distinct clues?
I arbitrarily chose to look at words that are in the 10000 most common words but have fewer than 5 clues. The 5 most common (although there are only 9 in total) are:

Word Occurrences Clues
ORAD 60 4
AVISO 38 4
BRAC 32 2
HADON 31 4
OEN 29 4

The occurrences column basically shows that for a word to have fewer than 5 distinct clues, it really can’t appear all that often. If it does, it ends up having at least technically diverse clues, like those for ERA.

Question. How has the popularity of words changed over time?
I plotted this for 10 common words – there don’t seem to be any appreciable trends.

crosswords_over_time

Question. How does the distribution of letters in crossword answers compare to in normal English?
One graph shows it all:
diffs

Some of these make sense: A and E are over-represented because it’s easy to fill in grids with them. Others make no sense at all: why is H so under-represented?

Question. What are the hardest words?
Well, first you have to come up with some way to measure difficulty: I assigned values of 1, 2, 3, 4, 5, 6, 4 to the days of the week in order, keeping with the conventional wisdom that Sunday is about as hard as Thursday. Here are the 5 words from the 10,000 most common words that had the highest average value:

Word Occurrences
ONLEAVE 4.97
ALACARTE 4.93
OPALINE 4.85
ETAMINE 4.79
ONEONEONE 4.78

2. The Clues

Question. What are the most common clues?
No real pattern to these:

Clue Occurrences
Concerning 317
Jai ___ 256
Greek letter 250
Chemical suffix 225
Gaelic 216

It’s interesting to look at the answers these clues point to: “Concerning” has a handful, “Jai ___” has only one (ALAI), “Greek letter” and “Chemical suffix” both have a decent mix again, and “Gaelic” is only ever a clue for ERSE.

Question. What are clues that have many distinct answers? 
I’m not going to give a list of the top results here, because many of them are things like “See 17-Across” or “Theme of this puzzle,” which are not very interesting. Two results that are in the right vein though, are “Split” and “Nonsense”, which have 47 and 40 possible answers respectively, that are listed out here and here.

Question. What are clues that have only one answer? 
There are tons of these. A small sample:

Clue Occurrences
___-majesté 143
___ avis 132
Dies ___ 132
Sri ___ 102
__-Magnon 98

Note that I’m omitting “Jai ___” and “Gaelic” for variety’s sake. As suggested by this sample, nearly all of the one-answer clues are fill in the blanks, which makes sense.

3. The Days

Question. How does the length of words vary over the week?
Not much, as it turns out:

length_by_day.png

Question. How does the frequency of puns vary over the week?
A lot, as it turns out:

puns_days

Question. How does the density of vowels vary over the week? 
And we’re back to not much:

vowels_days.pngSo it seems like puns best capture the difficulty gradient of the week.

4. The Years

Question. How does the length of words vary over the years?
In strange and inconsistent ways:

year_length.png

One fact that may be relevant: Will Shortz began editing the puzzle in 1993. I’m not sure if this explains the jump, but it’s certainly a compelling story.

Question. How does the frequency of puns vary over the years?
People love them more every year:

puns_years.png

It’s interesting to note that there’s still a spike in the mid-90s, but this spike comes 2-3 years after Shortz became editor.

Question. How does the density of vowels vary over the years? 
Basically the opposite of the puns:

vowels_years.png Again, the huge drop occurs when Shortz takes over.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s