WikiPlots

A dataset containing story plots from Wikipedia (books, movies, etc.) and the code for the extractor.

  • 所有者: markriedl/WikiPlots
  • 平台:
  • 許可證:
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

WikiPlots

The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word "plot" (e.g., "Plot", "Plot Summary", etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: plots.zip (updated: 09/26/2017). The zip file contains two files:

  • plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by <EOS> on a line by itself.
  • titles: a text file containing a list of titles for each article in whih a story plot was found and extracted.

Using the code to recreate the corpus

I have also included the Python script used to extract the story plots.

wikiPlots.py requires:

To use wikiPlots.py:

  1. Download an English Wikipedia dump. From this link you will find a file named something like "enwiki-20170401-pages-articles-multistream.xml.bz2". Make sure you download the .bz2 file that is not the index file.
  2. Unzip the bz2 file to extract the .xml file.
  3. Download wikiextractor. You do not need to set it up. Run it as follows:

python wikiextractor.py -o output_directory --json --html -s enwiki-...xml

You must run wikiextractor.py with these parameters. wikiPlots.py requires json files with nested html and with section header information preserved. Wikiextractor will produce a number of subfolders named "AA", "AB", "AC"... Within each folder will be a wiki_xx file containing a number of json records, one per article.

  1. Install the BeautifulSoup4 python package
  2. Download and run wikiPlots.py from this repository:

python wikiPlots.py wiki_dump_directory plot_file_name title_file_name

wiki_dump_directory should be the path to the directory containing the "AA", "AB", etc. folders. plot_file_name will be the name of the file that will contain the story plots. title_file_name will be the name of the file that will contain the list of story titles.

主要指標

概覽
名稱與所有者markriedl/WikiPlots
主編程語言Python
編程語言Python (語言數: 1)
平台
許可證
所有者活动
創建於2017-04-19 19:18:01
推送於2017-09-26 13:58:46
最后一次提交2017-09-26 09:58:45
發布數0
用户参与
星數315
關注者數12
派生數33
提交數22
已啟用問題?
問題數6
打開的問題數5
拉請求數1
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?