Homemade BookCorpus

You can reproduce BookCorpus by yourself.

BookCorpus is a popular large-scale text corpus, espetially for unsupervised learning of sentence encoders/decoders. However, BookCorpus is no longer distributed...

This repository includes a crawler collecting data from smashwords.com, which is the original source of BookCorpus.
Collected sentences may partially differ but the number of them will be larger or almost the same. If you use the new corpus in your work, please specify that it is a replica.

How to use

Prepare URLs of available books. However, this repository already has a list as url_list.jsonl which was a snapshot I (@soskek) collected on Jan 19-20, 2019. You can use it if you'd like.

python -u download_list.py > url_list.jsonl &

Download their files. Downloading is performed for txt files if possible. Otherwise, this tries to extract text from epub. The additional argument --trash-bad-count filters out epub files whose word count is largely different from its official stat (because it may imply some failure).

python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

The results are saved into the directory of --out (here, out_txts).

Postprocessing

Make concatenated text with sentence-per-line format.

python make_sentlines.py out_txts > all.txt

If you want to tokenize them into segmented words by Microsoft's BlingFire, run the below. You can use another choices for this by yourself.

python make_sentlines.py out_txts

Name With Owner	soskek/bookcorpus
Primary Language	Python
Program language	Python (Language Count: 1)
Platform	Linux, Mac, Windows
License:	MIT License

Name With Owner

soskek/bookcorpus

Primary Language

Python

Program language

Python (Language Count: 1)

Platform

Linux, Mac, Windows

License:

MIT License

Created At	2018-07-14 04:46:30
Pushed At	2023-07-14 06:34:00
Last Commit At	2023-07-14 15:34:00
Release Count	1
Last Release Name	v1.0 (Posted on )
First Release Name	v1.0 (Posted on )

Created At

2018-07-14 04:46:30

Pushed At

2023-07-14 06:34:00

Last Commit At

2023-07-14 15:34:00

Release Count

Last Release Name

v1.0 (Posted on )

First Release Name

v1.0 (Posted on )

Stargazers Count	836
Watchers Count	16
Fork Count	111
Commits Count	50
Has Issues Enabled
Issues Count	15
Issue Open Count	5
Pull Requests Count	12
Pull Requests Open Count	1
Pull Requests Close Count	2

Stargazers Count

836

Watchers Count

Fork Count

111

Commits Count

Has Issues Enabled

Issues Count

Issue Open Count

Pull Requests Count

Pull Requests Open Count

Pull Requests Close Count

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Has Wiki Enabled

Is Archived

Is Fork

Is Locked

Is Mirror

Is Private

Homemade BookCorpus

Github stars Tracking Chart

Homemade BookCorpus

How to use

Postprocessing

Main metrics