couplet-dataset

对联数据集。

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
test/in.txt: Same as train/in.txt but with less data.
test/out.txt: Same as train/out.txt but with less data.
vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.

名稱與所有者	wb14123/couplet-dataset
主編程語言	Python
編程語言	Python (語言數: 1)
平台
許可證	GNU Affero General Public License v3.0

名稱與所有者

wb14123/couplet-dataset

主編程語言

Python

編程語言

Python (語言數: 1)

平台

許可證

GNU Affero General Public License v3.0

創建於

2018-02-24 02:31:39

推送於

2025-02-05 02:28:19

最后一次提交

2025-02-04 21:28:18

發布數