couplet-dataset

Dataset for couplets. 70万条对联数据库。

  • 所有者: wb14123/couplet-dataset
  • 平台:
  • 許可證: GNU Affero General Public License v3.0
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

对联数据集。

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

Download the data

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

  1. train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
  2. train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
  3. test/in.txt: Same as train/in.txt but with less data.
  4. test/out.txt: Same as train/out.txt but with less data.
  5. vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.

主要指標

概覽
名稱與所有者wb14123/couplet-dataset
主編程語言Python
編程語言Python (語言數: 1)
平台
許可證GNU Affero General Public License v3.0
所有者活动
創建於2018-02-24 02:31:39
推送於2025-02-05 02:28:19
最后一次提交2025-02-04 21:28:18
發布數1
最新版本名稱1.0 (發布於 )
第一版名稱1.0 (發布於 )
用户参与
星數739
關注者數18
派生數221
提交數11
已啟用問題?
問題數3
打開的問題數3
拉請求數0
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?