these-3-words/words/README.md

1.3 KiB

Wordnet wordlist

This is a wordlist based on the lemmas in WordNet. It produces a list of words much less esoteric than the google ngram list below.

Run wordnet.py to create the wordnet wordlist.

Creating the google word list

Download the corpus from google ngram with:

for a in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-$a.gz;
done

then you can filter the words like this:

for L in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
    gzcat googlebooks-eng-all-1gram-20120701-$L.gz | python ngram-filter.py > googlebooks-eng-all-1gram-20120701-$L-filtered;
done

To get a list of the top 300 words:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | tail -n 300

To create the wordlist used by These3Words run:

sort -n googlebooks-eng-all-1gram-20120701-*-filtered | python normalise-words.py | sort | uniq | tail -n32768 > google-ngram-list

Check that your list is long enough by counting the lines in google-ngram-list, you need exactly 32768 words