f88c7f856d
This word list is still pretty bad ... |
||
---|---|---|
.. | ||
README.md | ||
google-ngram-list | ||
ngram-filter.py | ||
normalise-words.py |
README.md
Creating a word list
Download the corpus from google ngram with:
for a in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
wget http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-$a.gz;
done
then you can filter the words like this:
for L in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
gzcat googlebooks-eng-all-1gram-20120701-$L.gz | python ngram-filter.py > googlebooks-eng-all-1gram-20120701-$L-filtered;
done
To get a list of the top 300 words:
sort -n googlebooks-eng-all-1gram-20120701-*-filtered | tail -n 300
To create the wordlist used by These3Words
run:
sort -n googlebooks-eng-all-1gram-20120701-*-filtered | python normalise-words.py | sort | uniq | tail -n32768 > google-ngram-list
Check that your list is long enough by counting the lines
in google-ngram-list
, you need exactly 32768 words