Word Typicality, An Investigation

Overview

We seek here to learn about the most "typical" word of length n for various n's of interest.  We define "typical" to mean: the most commonly suggested spelling correction for all strings of length n drawing from the standard English alphabet: A-Z.

For instance if I type a word like "typicax" the suggested correction would be "typical".  If we type a nonsense string of letters such as "rtyufghj" we can also check to see what the suggested spelling correction is, in this case "retying".

In a sense, the "typical" word of length n is the word that a random string is most drawn towards.  By hypothesis, it represents the true inner nature of words of a given length.

Here are the results followed by some notes on methods.

Note: for words of length 1 and 2 spellcheckers don't really seem to think it is worth correcting us so results start at 3.

Results

nTypicalRunners Up
3     JFK kWh, sqq
4     hajj TGIF, USCG
5     Kojak toxic, Kafka
6     Keokuk zigzag, Koufax
7     skyjack zigzags, jukebox
8     skyjacks exegetic, coccyges
9     Reykjavik cockscomb, skyjacker
10     Jogjakarta Kafkaesque, toxicology
11     crackerjack coffeecakes, gynecologic
12     crackerjacks lexicography, cartographic
13     toxicological carcinogenics, lexicographer
14     toxicological lexicographer, lexicographic
15     staphylococcus     geographically, autobiographic


Details on Method

Ideally we would check every combination of n letters for each n but this quickly becomes infeasible.  Instead we use random sampling and choose random strings of length n and check these repeatedly.  We monitor the top positions over time and wait for the list to stay ordered and then just call it good after a while.

"aspell" (linux command line tool) was used as the spelling correction oracle.  If the input is a word it returns nothing.  If the random string is "too crazy" it also returns nothing.  If it deems the input word-like enough it will return a list of 1 or more candidate corrections from most to least likely (at least this is my assumption, why would it be otherwise?).

The top correction is kept and the final result is a dictionary of spelling corrections and a count.  The highest count is deemed the "typical" word for a given string length.

NOTES:

  • Sometimes a spelling correction suggestion will contain punctuation (e.g. an apostrophe).  These were excluded since they deviate from the platonic form of an n letter word.

Directions for further research

  • Try various spelling tools (something besides aspell)
  • Use various other dictionaries:  slang, scrabble
  • Reproduce for other languages besides English
  • Allow characters other than letters
  • Proper nouns, should these be excluded?
  • Words are sometimes corrected to longer/shorter variants.  Here I've ignored this effect.  We might get different values if we tracked these.
  • How do we know we've run enough random samples?   Current methodology is to go until it seems like enough.  Perhaps could be more precise.

Conclusions

Lots of j's and k's and foreign words.  Seems like English would rather be another language.
Maybe we should give some of these less common letters a chance to shine.

Acknowledgements

I thank no one.  I am a lone genius.

Comments