BERT-pytorch

GitHub issues

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA),
including outperform the human F1 score on SQuAD v1.1 QA task.
This paper proved that Transformer(self-attention) based encoder can be powerfully used as
alternative of previous language model with proper language model training method.
And more importantly, they showed us that this pre-trained language model can be transfer
into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history,
and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly.
Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods,
which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

Randomly 80% of tokens, gonna be a [MASK] token
Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is
not directly captured by language modeling

Rules:

Randomly 50% of next sentence, gonna be continuous sentence.
Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)

License

This project following Apache 2.0 License as written in LICENSE file

名稱與所有者	codertimo/BERT-pytorch
主編程語言	Python
編程語言	Python (語言數: 2)
平台
許可證	Apache License 2.0

創建於	2018-10-15 12:58:15
推送於	2023-09-15 12:57:08
最后一次提交	2018-10-30 16:42:26
發布數	5
最新版本名稱	0.0.1a4 (發布於 )
第一版名稱	0.0.1a0 (發布於 )

星數	6.5k
關注者數	124
派生數	1.3k
提交數	64
已啟用問題?
問題數	88
打開的問題數	57
拉請求數	9
打開的拉請求數	12
關閉的拉請求數	1

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

BERT-pytorch

Github星跟蹤圖

BERT-pytorch

Introduction

Installation

Quickstart

0. Prepare your corpus

1. Building vocab based on your corpus

2. Train your own BERT model

Language Model Pre-training

Masked Language Model

Rules:

Predict Next Sentence

Rules:

Author

License

主要指標