Homemade BookCorpus
You can reproduce BookCorpus by yourself.
BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...
This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus.
Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.
How to use
Prepare URLs of available books. However, this repository already has a list as url_list.jsonl
which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.
python -u download_list.py > url_list.jsonl &
Download their files. Downloading is performed for txt
files if possible. Otherwise, this tries to extract text from epub
. The additional argument --trash-bad-count
filters out epub
files whose word count is largely different from its official stat (because it may imply some failure).
python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count
The results are saved into the directory of --out
(here, out_txts
).
Postprocessing
Make concatenated text with sentence-per-line format.
python make_sentlines.py out_txts > all.txt
If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.
python make_sentlines.py out_txts