Homemade BookCorpus

You can reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus.
Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts

名稱與所有者	soskek/bookcorpus
主編程語言	Python
編程語言	Python (語言數: 1)
平台	Linux, Mac, Windows
許可證	MIT License

名稱與所有者

soskek/bookcorpus

主編程語言

Python

編程語言

Python (語言數: 1)

平台

Linux, Mac, Windows

許可證

MIT License

創建於	2018-07-14 04:46:30
推送於	2023-07-14 06:34:00
最后一次提交	2023-07-14 15:34:00
發布數	1
最新版本名稱	v1.0 (發布於 )
第一版名稱	v1.0 (發布於 )

創建於

2018-07-14 04:46:30

推送於

2023-07-14 06:34:00

最后一次提交

2023-07-14 15:34:00

發布數

星數	846
關注者數	15
派生數	109
提交數	50
已啟用問題?
問題數	15
打開的問題數	5
拉請求數	12
打開的拉請求數	1
關閉的拉請求數	2

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

Homemade BookCorpus

Github星跟蹤圖

Homemade BookCorpus

How to use

Postprocessing

主要指標