I’ve been doing a lot of crosswords lately, so I thought I’d spend some time playing around with old crosswords to see if there was anything interesting to say about them. This whole post is structured as a series of questions I or other people asked, which I then answered using data from this repository. If there’s anything not covered here you’d like to know about, feel free to get in touch with me!
Just for reference, there are 1229405 clue/answer pairs, taken from 14524 puzzles. After throwing out duplicates, there are 140628 distinct answers, and 696050 distinct clues.
1. The Words
Question. What are the most common words?
It turns out that these are almost exclusively short, vowel-rich words, with the top 5 being:
These aren’t especially interesting, so I then filtered by length i.e. found the most common word of each length. They’re a bit unwieldy to format here, but you can see them all here. The smaller ones are still boring, but the longer ones are cute, and the sudden uptick in frequencies for the final block is especially helpful: there just aren’t that many very long words appearing in crosswords, it seems.
Question. What is the longest word?
This turns out to be a bit tricky to answer: doing the obvious sorting gives the answer of:
which is from the 12/30/2001 puzzle, which I couldn’t find a copy of. The second longest is the theme answer from this puzzle, which makes it clear that the previous one was probably a rebus too.
If you don’t count a rebus answer (which is reasonable), then there’s a whole slew of 23 letter words coming from a time when the Sunday was 23 by 23. If you don’t want to count those either, then you’re left with the pile of 21 letter words that span modern Sunday grids.
Question. What word has the most distinct clues?
The answer here turns out to be for the most part, the same as the common words:
Just to take a peek under the hood, all of the distinct clues for ERA can be seen here, and they seem to fall into three categories:
- ERA like the time period
- ERA like the baseball statistic
- ERA like the Equal Rights Amendment
So, on a spiritual level, there’s only 3 clues for ERA. On the other hand, ONE actually has many meaningfully different clues, all of which are listed here. I wish there was some way to capture this sense of “spiritually different clues” and then find the answer with the most “spiritually diverse” clues, but I can’t think of a good algorithmic way to do it.
Question. What words have very few distinct clues?
I arbitrarily chose to look at words that are in the 10000 most common words but have fewer than 5 clues. The 5 most common (although there are only 9 in total) are:
The occurrences column basically shows that for a word to have fewer than 5 distinct clues, it really can’t appear all that often. If it does, it ends up having at least technically diverse clues, like those for ERA.
Question. How has the popularity of words changed over time?
I plotted this for 10 common words – there don’t seem to be any appreciable trends.
Question. How does the distribution of letters in crossword answers compare to in normal English?
One graph shows it all:
Some of these make sense: A and E are over-represented because it’s easy to fill in grids with them. Others make no sense at all: why is H so under-represented?
Question. What are the hardest words?
Well, first you have to come up with some way to measure difficulty: I assigned values of 1, 2, 3, 4, 5, 6, 4 to the days of the week in order, keeping with the conventional wisdom that Sunday is about as hard as Thursday. Here are the 5 words from the 10,000 most common words that had the highest average value:
2. The Clues
Question. What are the most common clues?
No real pattern to these:
It’s interesting to look at the answers these clues point to: “Concerning” has a handful, “Jai ___” has only one (ALAI), “Greek letter” and “Chemical suffix” both have a decent mix again, and “Gaelic” is only ever a clue for ERSE.
Question. What are clues that have many distinct answers?
I’m not going to give a list of the top results here, because many of them are things like “See 17-Across” or “Theme of this puzzle,” which are not very interesting. Two results that are in the right vein though, are “Split” and “Nonsense”, which have 47 and 40 possible answers respectively, that are listed out here and here.
Question. What are clues that have only one answer?
There are tons of these. A small sample:
Note that I’m omitting “Jai ___” and “Gaelic” for variety’s sake. As suggested by this sample, nearly all of the one-answer clues are fill in the blanks, which makes sense.
3. The Days
Question. How does the length of words vary over the week?
Not much, as it turns out:
Question. How does the frequency of puns vary over the week?
A lot, as it turns out:
Question. How does the density of vowels vary over the week?
And we’re back to not much:
So it seems like puns best capture the difficulty gradient of the week.
4. The Years
Question. How does the length of words vary over the years?
In strange and inconsistent ways:
One fact that may be relevant: Will Shortz began editing the puzzle in 1993. I’m not sure if this explains the jump, but it’s certainly a compelling story.
Question. How does the frequency of puns vary over the years?
People love them more every year:
It’s interesting to note that there’s still a spike in the mid-90s, but this spike comes 2-3 years after Shortz became editor.
Question. How does the density of vowels vary over the years?
Basically the opposite of the puns:
Again, the huge drop occurs when Shortz takes over.