couplet-dataset

Dataset for couplets. 70万条对联数据库。

  • Owner: wb14123/couplet-dataset
  • Platform:
  • License:: GNU Affero General Public License v3.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

对联数据集。

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

Download the data

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

  1. train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
  2. train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
  3. test/in.txt: Same as train/in.txt but with less data.
  4. test/out.txt: Same as train/out.txt but with less data.
  5. vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.

Main metrics

Overview
Name With Ownerwb14123/couplet-dataset
Primary LanguagePython
Program languagePython (Language Count: 1)
Platform
License:GNU Affero General Public License v3.0
所有者活动
Created At2018-02-24 02:31:39
Pushed At2025-02-05 02:28:19
Last Commit At2025-02-04 21:28:18
Release Count1
Last Release Name1.0 (Posted on )
First Release Name1.0 (Posted on )
用户参与
Stargazers Count739
Watchers Count18
Fork Count221
Commits Count11
Has Issues Enabled
Issues Count3
Issue Open Count3
Pull Requests Count0
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private