Google AI 2018 BERT pytorch implementation

BERT-pytorch

LICENSE GitHub issues GitHub stars CircleCI PyPI PyPI - Status Documentation Status

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

  1. Randomly 80% of tokens, gonna be a [MASK] token
  2. Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
  3. Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is not directly captured by language modeling

Rules:

  1. Randomly 50% of next sentence, gonna be continuous sentence.
  2. Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab ([email protected] / [email protected])

License

This project following Apache 2.0 License as written in LICENSE file

Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors

Copyright (c) 2018 Alexander Rush : The Annotated Trasnformer

Owner
Junseong Kim
Scatter Lab, Machine Learning Research Scientist, NLP
Junseong Kim
Comments
  • Very low GPU usage when training on 8 GPU in a single machine

    Very low GPU usage when training on 8 GPU in a single machine

    Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
    I found only 20% of the GPU is in use.

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
    | N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
    | N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
    | N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
    | N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
    | N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
    | N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
    | N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
    | N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
    +-------------------------------+----------------------+----------------------+
    

    I am not familiar with pytorch. Any one konws why?

  • Example of Input Data

    Example of Input Data

    Could you give a concrete example of the input data? You gave an example of the corpus data, but not the dataset.small file found in this line:

    bert -c data/dataset.small -v data/vocab.small -o output/bert.model

    If you could show perhaps a couple of examples, that would be very helpful! I am new to pytorch, so the dataloader function is a little confusing.

  • Why doesn't the counter in data_iter increase?

    Why doesn't the counter in data_iter increase?

    I am currently playing around with training and testing the model. However, as I implemented the test section, I'm noticing that during the LM training, your counter doesn't increase when looping over data_iter found in pretrain.py. This would cause problems when calculating the average loss/accuracy, wouldn't it?

    image

  • model/embedding/position.py

    model/embedding/position.py

    div_term = (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp() should be: div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

    In [51]: (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp() ...: Out[51]: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

    Additional question: I don't quite understand how "bidirectional" transformer in the raw paper implemented. Maybe like BiLSTM: concat two direction's transformer output together? Didn't find the similar structure in your code.

  • The LayerNorm implementation

    The LayerNorm implementation

    I am wondering why don't you use the standard nn version of LayerNorm? I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

    Could you clarify these 2 approaches?

  • How to embedding segment lable

    How to embedding segment lable

    Thanks for you code ,which let me leran more details for this papper .But i cant't understand segment.py. You haven't writeen how to embedding segment lable .

  • The question about the implement of learning_rate

    The question about the implement of learning_rate

    Nice implements! However, I have a question about learning rate. The learning_rate schedule which from the origin Transformers is warm-up restart, but your implement just simple decay. Could you implement it in your BERT code?

  • Question about random sampling.

    Question about random sampling.

    https://github.com/codertimo/BERT-pytorch/blob/7efd2b5a631f18ebc83cd16886b8c6ee77a40750/bert_pytorch/dataset/dataset.py#L50-L64

    Well, seems random.random() always returns a positive number, so prob >= prob * 0.9 will always be true?

  • Mask language model loss

    Mask language model loss

    Hi, Thank you for your clean code on Bert. I have a question about Mask LM loss after I read your code. Your program computes a mask language model loss on both positive sentence pairs and negative pairs.

    Does it make sense to compute Mask LM loss on negative sentence pairs? I am not sure how Google computes this loss.

  • imbalance GPU memory usage

    imbalance GPU memory usage

    Hi,

    Nice try for BERT implementation.

    I try to run your code in 4V100 and I find the memory usage is imbalance: the first GPU consume 2x memory than the others. Any idea about the reason?

    Btw, I think the parameter order in train.py line 64 is incorrect.

  • [BERT] Cannot import bert

    [BERT] Cannot import bert

    I have problems importing bert when following http://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

    (mxnet_p36) [[email protected] ~]$ ipython
    Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
    
    In [1]: import warnings
       ...: warnings.filterwarnings('ignore')
       ...:
       ...: import random
       ...: import numpy as np
       ...: import mxnet as mx
       ...: from mxnet import gluon
       ...: import gluonnlp as nlp
       ...:
       ...:
    
    
    In [2]:
    
    In [2]: np.random.seed(100)
       ...: random.seed(100)
       ...: mx.random.seed(10000)
       ...: ctx = mx.gpu(0)
       ...:
       ...:
    
    In [3]: from bert import *
       ...:
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-3-40b999f3ea6a> in <module>()
    ----> 1 from bert import *
    
    ModuleNotFoundError: No module named 'bert'
    

    Looks gluonnlp are successfully installed. Any idea?

    (mxnet_p36) [[email protected] site-packages]$ ll /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
    -rw-rw-r-- 1 ec2-user ec2-user 499320 Dec 28 23:15 /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
    
  • why specify `ignore_index=0` in the NLLLoss function in BERTTrainer?

    why specify `ignore_index=0` in the NLLLoss function in BERTTrainer?

    trainer/pretrain.py

    class BERTTrainer:
        def __init__(self, ...):
            ... 
            # Using Negative Log Likelihood Loss function for predicting the masked_token
            self.criterion = nn.NLLLoss(ignore_index=0)
            ...
    

    I cannot understand why ignore index=0 is specified when calculating NLLLoss. If the ground truth of is_next is False (label = 0) in terms of the NSP task but BERT predicts True, then NLLLoss will be 0 (or nan)... so what's the aim of ignore_index = 0 ???

    ====================

    Well, I've found that ignore_index = 0 is useful to the MLM task, but I still can't agree the NSP task should share the same NLLLoss with MLM.

  • Added a Google Colab Notebook that contains all the code in this project.

    Added a Google Colab Notebook that contains all the code in this project.

    For learning purposes, I added example.ipynb, which is a Google Colab Notebook that works right out of the box. I have also included an example data file that addresses #59 .

  • It keeps trying to use CUDA despite --with_cuda False option

    It keeps trying to use CUDA despite --with_cuda False option

    Hello,

    I have tried to run bert with --with_cuda False, but the model keeps running "forward" function on cuda. These are my command line and the error message I got.

    bert -c corpus.small -v vocab.small -o bert.model --with_cuda False -e 5

    Loading Vocab vocab.small Vocab Size: 262 Loading Train Dataset corpus.small Loading Dataset: 113it [00:00, 560232.09it/s] Loading Test Dataset None Creating Dataloader Building BERT model Creating BERT Trainer Total Parameters: 6453768 Training Start EP_train:0: 0%|| 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/yuni/anaconda3/envs/py3/bin/bert", line 8, in sys.exit(train()) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/main.py", line 67, in train trainer.train(epoch) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 69, in train self.iteration(epoch, self.train_data) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 102, in iteration next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"]) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/language_model.py", line 24, in forward x = self.bert(x, segment_label) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/bert.py", line 46, in forward x = transformer.forward(x, mask) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/transformer.py", line 29, in forward x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask)) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/utils/sublayer.py", line 18, in forward return x + self.dropout(sublayer(self.norm(x))) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/transformer.py", line 29, in x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask)) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/attention/multi_head.py", line 32, in forward x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/attention/single.py", line 25, in forward return torch.matmul(p_attn, value), p_attn RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.95 GiB total capacity; 309.18 MiB already allocated; 125.62 MiB free; 312.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

  • dataset / dataset.py have one erro?

    dataset / dataset.py have one erro?

    "
    def get_random_line(self): if self.on_memory: self.lines[random.randrange(len(self.lines))][1] " This code is to get the incorrect next sentence(isNotNext : 0), maybe it random get a lines it is (isnext:1)。

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

Dec 30, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Dec 1, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT
 VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Nov 1, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Dec 26, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

Nov 6, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Aug 19, 2021
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"
Code for paper

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Nov 11, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Sep 2, 2022
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Dec 27, 2022
Pytorch-Named-Entity-Recognition-with-BERT
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Dec 25, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。

Dec 30, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models

PyTorch impelementations of BERT-based Spelling Error Correction Models

Jun 29, 2021
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

Dec 5, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

Jan 3, 2023
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

Dec 21, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

Dec 2, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Jan 2, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Jan 8, 2023