these-3-words/words
Tim Head 9cdd13343a Removing some more rude words and automation of examples 2014-11-21 22:52:17 +01:00
..
README.md New word list based on google ngram from 2008 2014-11-21 09:23:07 +01:00
google-ngram-list Removing some more rude words and automation of examples 2014-11-21 22:52:17 +01:00
ngram-filter.py New word list based on google ngram from 2008 2014-11-21 09:23:07 +01:00
normalise-words.py Removing some more rude words and automation of examples 2014-11-21 22:52:17 +01:00
wordnet-list Yet another wordlist, based on WordNet this time 2014-11-17 18:25:37 +01:00
wordnet.py Yet another wordlist, based on WordNet this time 2014-11-17 18:25:37 +01:00

README.md

Wordnet wordlist

This is a wordlist based on the lemmas in WordNet. It produces a list of words much less esoteric than the google ngram list below.

Run wordnet.py to create the wordnet wordlist.

Creating the google word list

Download the corpus from google ngram with:

for a in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-$a.gz;
done

Filter out unpopular words, not between four and seven characters, containing punctuation and numbers, etc like this:

for L in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    gzcat googlebooks-eng-all-1gram-20120701-$L.gz | python ngram-filter.py > googlebooks-eng-all-1gram-20120701-$L-filtered;
done

To get a list of the top 300 words:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | tail -n 300

Final step in creating a wordlist useable by These3Words is to run:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | python normalise-words.py | sort -k 2 | uniq -f 1 | sort -n | tail -n32768 | awk '{print $2}' > google-ngram-list

Check that your list is long enough by counting the lines in google-ngram-list, you need exactly 32768 words.