Homemade BookCorpus

这些是你自己复制BookCorpus的脚本。「These are scripts to reproduce BookCorpus by yourself.」

Github星跟踪图

Homemade BookCorpus

You can reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus.
Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts

概览

名称与所有者soskek/bookcorpus
主编程语言Python
编程语言Python (语言数: 1)
平台Linux, Mac, Windows
许可证MIT License
发布数1
最新版本名称v1.0 (发布于 )
第一版名称v1.0 (发布于 )
创建于2018-07-14 04:46:30
推送于2023-07-14 06:34:00
最后一次提交2023-07-14 15:34:00
星数780
关注者数17
派生数109
提交数50
已启用问题?
问题数15
打开的问题数5
拉请求数12
打开的拉请求数1
关闭的拉请求数2
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?
去到顶部