ff5d158a4e | ||
---|---|---|
.. | ||
README.md | ||
google-ngram-list | ||
ngram-filter.py | ||
normalise-words.py | ||
wordnet-list | ||
wordnet.py |
README.md
Wordnet wordlist
This is a wordlist based on the lemmas in WordNet. It produces a list of words much less esoteric than the google ngram list below.
Run wordnet.py
to create the wordnet wordlist.
Creating the google word list
Download the corpus from google ngram with:
for a in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-$a.gz;
done
then you can filter the words like this:
for L in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
gzcat googlebooks-eng-all-1gram-20120701-$L.gz | python ngram-filter.py > googlebooks-eng-all-1gram-20120701-$L-filtered;
done
To get a list of the top 300 words:
sort -n googlebooks-eng-all-1gram-20120701-*-filtered | tail -n 300
To create the wordlist used by These3Words
run:
sort -n googlebooks-eng-all-1gram-20120701-*-filtered | python normalise-words.py | sort | uniq | tail -n32768 > google-ngram-list
Check that your list is long enough by counting the lines
in google-ngram-list
, you need exactly 32768 words