OpenNRE-PyTorch

在 PyTorch 中实现神经关系提取。「Neural Relation Extraction implemented in PyTorch」

Github stars Tracking Chart

OpenNRE-PyTorch

在 PyTorch 中实现的用于神经关系提取的开源框架。

Shulin Cao, Tianyu Gao, Xu Han, Lumin Tang, Yankai Lin, Zhiyuan Liu 所贡献。

总览

它是基于 PyTorch 的框架,用于轻松建立关系提取模型。 我们将关系提取流水线分为四个部分,即嵌入、编码器、选择器和分类器。 对于每个部分,我们已经实现了几种方法。

  • 嵌入
    • 词嵌入
    • 位置嵌入
    • 串联方法
  • 编码器
    • PCNN
    • CNN
  • 选择器
    • Attention
    • 最大值
    • 平均值
  • 分类器
    • Softmax 损失函数
    • 输出量

所有这些方法都可以自由组合。

我们还提供快速的培训和测试代码。 您可以使用Python参数来更改超参数或指定模型架构。 包装中还包含一种绘图方法。

该项目已获得MIT许可。

要求

  • Python (>=2.7)
  • PyTorch (==0.3.1)
  • CUDA (>=8.0)
  • Matplotlib (>=2.0.0)
  • scikit-learn (>=0.18)

安装

  1. Install PyTorch
  2. Clone the OpenNRE repository:
git clone https://github.com/ShulinCao/OpenNRE-PyTorch
  1. Download NYT dataset from Google Drive
  2. Extract dataset to ./raw_data
unzip raw_data.zip

数据集

NYT10 数据集

NYT10 是一个受远程监督的数据集,最初由论文“Sebastian Riedel,Limin Yao 和 Andrew McCallum 建模关系及其提及而没有带标签的文本”发布。 这是原始数据的下载链接。 您可以从 Google Drive 下载 NYT10 数据集。 数据细节如下:

培训数据和测试数据

包含句子及其对应实体对和关系的培训数据文件和测试数据文件应采用以下格式

[
    {
        'sentence': 'Bill Gates is the founder of Microsoft .',
        'head': {'word': 'Bill Gates', 'id': 'm.03_3d', ...(other information)},
        'tail': {'word': 'Microsoft', 'id': 'm.07dfk', ...(other information)},
        'relation': 'founder'
    },
    ...
]

重要说明:在句子部分,单词和标点符号之间应使用空格隔开。

词嵌入数据

词嵌入数据用于初始化网络中的词嵌入,并且应采用以下格式

[
    {'word': 'the', 'vec': [0.418, 0.24968, ...]},
    {'word': ',', 'vec': [0.013441, 0.23682, ...]},
    ...
]

关系 ID 映射数据

该文件指示用于建立关系的相应ID,以确保在每个培训和测试期间,相同的ID表示相同的关系。 其格式如下

{
    'NA': 0,
    'relation_1': 1,
    'relation_2': 2,
    ...
}

重要说明:确保 NA 的 ID 始终为 0。

快速开始

处理数据

python gen_data.py

处理后的数据将存储在 ./data 中。

训练模型

python train.py --model_name pcnn_att

arg model_name 指定模型架构,而 pcnn_att 是我们模型之一的名称。所有可用的模型均位于 ./models 中。关于其他参数,请参考 ./train.py。一旦开始训练,所有检查点都将存储在 ./checkpoint 中。

测试模型

python test.py --model_name pcnn_att

用法与训练相同。完成测试后,最佳检查点的对应 pr 曲线数据将存储在 ./test_result 中。

绘图

python draw_plot.py PCNN_ATT

该图将另存为 ./test_result/pr_curve.png。您可以在参数中指定多个模型,例如 python draw_plot.py PCNN_ATT PCNN_ONE PCNN_AVE,只要这些模型的结果位于 ./test_result 中即可。

建立自己的模型

您不仅可以训练和测试我们软件包中的现有模型,还可以构建自己的模型或向四个基本模块中添加方法。添加新模型时,您可以在 ./models 中创建一个与模型同名的 python 文件,并按以下方式实现它:

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from networks.embedding import *
from networks.encoder import *
from networks.selector import *
from networks.classifier import *
from .Model import Model
class PCNN_ATT(Model):
  def __init__(self, config):
    super(PCNN_ATT, self).__init__(config)
    self.encoder = PCNN(config)
    self.selector = Attention(config, config.hidden_size * 3)

然后,您可以训练,测试和绘图!


(The first version translated by vz on 2020.07.19)

Main metrics

Overview
Name With OwnerShulinCao/OpenNRE-PyTorch
Primary LanguagePython
Program languagePython (Language Count: 1)
PlatformLinux, Mac, Windows
License:MIT License
所有者活动
Created At2018-08-06 05:26:48
Pushed At2018-11-15 02:27:38
Last Commit At2018-11-15 10:27:37
Release Count0
用户参与
Stargazers Count219
Watchers Count6
Fork Count45
Commits Count10
Has Issues Enabled
Issues Count27
Issue Open Count14
Pull Requests Count0
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

OpenNRE-PyTorch

An open-source framework for neural relation extraction implemented in PyTorch.

Contributed by Shulin Cao, Tianyu Gao, Xu Han, Lumin Tang, Yankai Lin, Zhiyuan Liu

Overview

It is a PyTorch-based framwork for easily building relation extraction models. We divide the pipeline of relation extraction into four parts, which are embedding, encoder, selector and classifier. For each part we have implemented several methods.

  • Embedding
    • Word embedding
    • Position embedding
    • Concatenation method
  • Encoder
    • PCNN
    • CNN
  • Selector
    • Attention
    • Maximum
    • Average
  • Classifier
    • Softmax loss function
    • Output

All those methods could be combined freely.

We also provide fast training and testing codes. You could change hyper-parameters or appoint model architectures by using Python arguments. A plotting method is also in the package.

This project is under MIT license.

Requirements

  • Python (>=2.7)
  • PyTorch (==0.3.1)
  • CUDA (>=8.0)
  • Matplotlib (>=2.0.0)
  • scikit-learn (>=0.18)

Installation

  1. Install PyTorch
  2. Clone the OpenNRE repository:
git clone https://github.com/ShulinCao/OpenNRE-PyTorch
  1. Download NYT dataset from Google Drive
  2. Extract dataset to ./raw_data
unzip raw_data.zip

Dataset

NYT10 Dataset

NYT10 is a distantly supervised dataset originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text.". Here is the download link for the original data.
You can download the NYT10 dataset from Google Drive. And the data details are as follows.

Training Data & Testing Data

Training data file and testing data file, containing sentences and their corresponding entity pairs and relations, should be in the following format

[
    {
        'sentence': 'Bill Gates is the founder of Microsoft .',
        'head': {'word': 'Bill Gates', 'id': 'm.03_3d', ...(other information)},
        'tail': {'word': 'Microsoft', 'id': 'm.07dfk', ...(other information)},
        'relation': 'founder'
    },
    ...
]

IMPORTANT: In the sentence part, words and punctuations should be separated by blank spaces.

Word Embedding Data

Word embedding data is used to initialize word embedding in the networks, and should be in the following format

[
    {'word': 'the', 'vec': [0.418, 0.24968, ...]},
    {'word': ',', 'vec': [0.013441, 0.23682, ...]},
    ...
]

Relation-ID Mapping Data

This file indicates corresponding IDs for relations to make sure during each training and testing period, the same ID means the same relation. Its format is as follows

{
    'NA': 0,
    'relation_1': 1,
    'relation_2': 2,
    ...
}

IMPORTANT: Make sure the ID of NA is always 0.

Quick Start

Process Data

python gen_data.py

The processed data will be stored in ./data

Train Model

python train.py --model_name pcnn_att

The arg model_name appoints model architecture, and pcnn_att is the name of one of our models. All available models are in ./models. About other arguments please refer to ./train.py. Once you start training, all checkpoints are stored in ./checkpoint.

Test Model

python test.py --model_name pcnn_att

Same usage as training. When finishing testing, the best checkpoint's corresponding pr-curve data will be stored in ./test_result.

Plot

python draw_plot.py PCNN_ATT

The plot will be saved as ./test_result/pr_curve.png. You could appoint several models in the arguments, like python draw_plot.py PCNN_ATT PCNN_ONE PCNN_AVE, as long as there are these models' results in ./test_result.

Build Your Own Model

Not only could you train and test existing models in our package, you could also build your own model or add methods to the four basic modules. When adding a new model, you could create a python file in ./models having the same name as the model and implement it like following:

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from networks.embedding import *
from networks.encoder import *
from networks.selector import *
from networks.classifier import *
from .Model import Model
class PCNN_ATT(Model):
  def __init__(self, config):
    super(PCNN_ATT, self).__init__(config)
    self.encoder = PCNN(config)
    self.selector = Attention(config, config.hidden_size * 3)

Then you can train, test and plot!