Word Typicality, An Investigation

Overview

We seek here to learn about the most "typical" word of length n for various n's of interest. We define "typical" to mean: the most commonly suggested spelling correction for all strings of length n drawing from the standard English alphabet: A-Z.

For instance if I type a word like "typicax" the suggested correction would be "typical". If we type a nonsense string of letters such as "rtyufghj" we can also check to see what the suggested spelling correction is, in this case "retying".

In a sense, the "typical" word of length n is the word that a random string is most drawn towards. By hypothesis, it represents the true inner nature of words of a given length.

Here are the results followed by some notes on methods.

Note: for words of length 1 and 2 spellcheckers don't really seem to think it is worth correcting us so results start at 3.

Results

n	Typical	Runners Up
3	JFK	kWh, sqq
4	hajj	TGIF, USCG
5	Kojak	toxic, Kafka
6	Keokuk	zigzag, Koufax
7	skyjack	zigzags, jukebox
8	skyjacks	exegetic, coccyges
9	Reykjavik	cockscomb, skyjacker
10	Jogjakarta	Kafkaesque, toxicology
11	crackerjack	coffeecakes, gynecologic
12	crackerjacks	lexicography, cartographic
13	toxicological	carcinogenics, lexicographer
14	toxicological	lexicographer, lexicographic
15	staphylococcus	geographically, autobiographic

Details on Method

Ideally we would check every combination of n letters for each n but this quickly becomes infeasible. Instead we use random sampling and choose random strings of length n and check these repeatedly. We monitor the top positions over time and wait for the list to stay ordered and then just call it good after a while.

"aspell" (linux command line tool) was used as the spelling correction oracle. If the input is a word it returns nothing. If the random string is "too crazy" it also returns nothing. If it deems the input word-like enough it will return a list of 1 or more candidate corrections from most to least likely (at least this is my assumption, why would it be otherwise?).

The top correction is kept and the final result is a dictionary of spelling corrections and a count. The highest count is deemed the "typical" word for a given string length.

NOTES:

Sometimes a spelling correction suggestion will contain punctuation (e.g. an apostrophe). These were excluded since they deviate from the platonic form of an n letter word.

Directions for further research

Try various spelling tools (something besides aspell)
Use various other dictionaries: slang, scrabble
Reproduce for other languages besides English
Allow characters other than letters
Proper nouns, should these be excluded?
Words are sometimes corrected to longer/shorter variants. Here I've ignored this effect. We might get different values if we tracked these.
How do we know we've run enough random samples? Current methodology is to go until it seems like enough. Perhaps could be more precise.

Conclusions

Lots of j's and k's and foreign words. Seems like English would rather be another language.
Maybe we should give some of these less common letters a chance to shine.

Acknowledgements

I thank no one. I am a lone genius.

MadeOfMistake

Search This Blog