History

Tim Head 9cdd13343a Removing some more rude words and automation of examples		2014-11-21 22:52:17 +01:00
..
README.md	New word list based on google ngram from 2008	2014-11-21 09:23:07 +01:00
google-ngram-list	Removing some more rude words and automation of examples	2014-11-21 22:52:17 +01:00
ngram-filter.py	New word list based on google ngram from 2008	2014-11-21 09:23:07 +01:00
normalise-words.py	Removing some more rude words and automation of examples	2014-11-21 22:52:17 +01:00
wordnet-list	Yet another wordlist, based on WordNet this time	2014-11-17 18:25:37 +01:00
wordnet.py	Yet another wordlist, based on WordNet this time	2014-11-17 18:25:37 +01:00

README.md

Wordnet wordlist

This is a wordlist based on the lemmas in WordNet. It produces a list of words much less esoteric than the google ngram list below.

Run wordnet.py to create the wordnet wordlist.

Creating the google word list

Download the corpus from google ngram with:

for a in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-$a.gz;
done

Filter out unpopular words, not between four and seven characters, containing punctuation and numbers, etc like this:

for L in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    gzcat googlebooks-eng-all-1gram-20120701-$L.gz | python ngram-filter.py > googlebooks-eng-all-1gram-20120701-$L-filtered;
done

To get a list of the top 300 words:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | tail -n 300

Final step in creating a wordlist useable by These3Words is to run:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | python normalise-words.py | sort -k 2 | uniq -f 1 | sort -n | tail -n32768 | awk '{print $2}' > google-ngram-list

Check that your list is long enough by counting the lines in google-ngram-list, you need exactly 32768 words.