BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

WebsiteUsageDownloadMultiBPEmbPaper (pdf)Citing BPEmb

Usage

Install BPEmb with pip:

pip install bpemb

Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

>>> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
>>> bpemb_en = BPEmb(lang="en", dim=50)
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

You can do two main things with BPEmb. The first is subword segmentation:

>> bpemb_zh = BPEmb(lang="zh", vs=100000) # apply Chinese BPE subword segmentation model >>> bpemb_zh.encode("这是一个中文句子") # "This is a Chinese sentence." ['▁这是一个', '中文', '句子'] # ["This is a", "Chinese", "sentence"] ">
# apply English BPE subword segmentation model
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
# apply Chinese BPE subword segmentation model
>>> bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size segmentation
1000 ['▁str', 'at', 'f', 'ord']
3000 ['▁str', 'at', 'ford']
5000 ['▁str', 'at', 'ford']
10000 ['▁strat', 'ford']
25000 ['▁stratford']
50000 ['▁stratford']
100000 ['▁stratford']
200000 ['▁stratford']

The second purpose of BPEmb is to provide pretrained subword embeddings:

>> type(bpemb_en.vectors) numpy.ndarray >>> bpemb_en.vectors.shape (10000, 50) >>> bpemb_zh.vectors.shape (100000, 100) ">
# Embeddings are wrapped in a gensim KeyedVectors object
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
>>> bpemb_en.most_similar("ford")
[('bury', 0.8745079040527344),
 ('ton', 0.8725000619888306),
 ('well', 0.871537446975708),
 ('ston', 0.8701574206352234),
 ('worth', 0.8672043085098267),
 ('field', 0.859795331954956),
 ('ley', 0.8591548204421997),
 ('ington', 0.8126075267791748),
 ('bridge', 0.8099068999290466),
 ('brook', 0.7979353070259094)]
>>> type(bpemb_en.vectors)
numpy.ndarray
>>> bpemb_en.vectors.shape
(10000, 50)
>>> bpemb_zh.vectors.shape
(100000, 100)

To use subword embeddings in your neural network, either encode your input into subword IDs:

>> bpemb_zh.vectors[ids].shape (3, 100) ">
>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)

Or use the embed method:

# apply Chinese subword segmentation and perform embedding lookup
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

Downloads for each language

ab (Abkhazian)ace (Achinese)ady (Adyghe)af (Afrikaans)ak (Akan)als (Alemannic)am (Amharic)an (Aragonese)ang (Old English)ar (Arabic)arc (Official Aramaic)arz (Egyptian Arabic)as (Assamese)ast (Asturian)atj (Atikamekw)av (Avaric)ay (Aymara)az (Azerbaijani)azb (South Azerbaijani)

ba (Bashkir)bar (Bavarian)bcl (Central Bikol)be (Belarusian)bg (Bulgarian)bi (Bislama)bjn (Banjar)bm (Bambara)bn (Bengali)bo (Tibetan)bpy (Bishnupriya)br (Breton)bs (Bosnian)bug (Buginese)bxr (Russia Buriat)

ca (Catalan)cdo (Min Dong Chinese)ce (Chechen)ceb (Cebuano)ch (Chamorro)chr (Cherokee)chy (Cheyenne)ckb (Central Kurdish)co (Corsican)cr (Cree)crh (Crimean Tatar)cs (Czech)csb (Kashubian)cu (Church Slavic)cv (Chuvash)cy (Welsh)

da (Danish)de (German)din (Dinka)diq (Dimli)dsb (Lower Sorbian)dty (Dotyali)dv (Dhivehi)dz (Dzongkha)

ee (Ewe)el (Modern Greek)en (English)eo (Esperanto)es (Spanish)et (Estonian)eu (Basque)ext (Extremaduran)

fa (Persian)ff (Fulah)fi (Finnish)fj (Fijian)fo (Faroese)fr (French)frp (Arpitan)frr (Northern Frisian)fur (Friulian)fy (Western Frisian)

ga (Irish)gag (Gagauz)gan (Gan Chinese)gd (Scottish Gaelic)gl (Galician)glk (Gilaki)gn (Guarani)gom (Goan Konkani)got (Gothic)gu (Gujarati)gv (Manx)

ha (Hausa)hak (Hakka Chinese)haw (Hawaiian)he (Hebrew)hi (Hindi)hif (Fiji Hindi)hr (Croatian)hsb (Upper Sorbian)ht (Haitian)hu (Hungarian)hy (Armenian)

ia (Interlingua)id (Indonesian)ie (Interlingue)ig (Igbo)ik (Inupiaq)ilo (Iloko)io (Ido)is (Icelandic)it (Italian)iu (Inuktitut)

ja (Japanese)jam (Jamaican Creole English)jbo (Lojban)jv (Javanese)

ka (Georgian)kaa (Kara-Kalpak)kab (Kabyle)kbd (Kabardian)kbp (Kabiyè)kg (Kongo)ki (Kikuyu)kk (Kazakh)kl (Kalaallisut)km (Central Khmer)kn (Kannada)ko (Korean)koi (Komi-Permyak)krc (Karachay-Balkar)ks (Kashmiri)ksh (Kölsch)ku (Kurdish)kv (Komi)kw (Cornish)ky (Kirghiz)

la (Latin)lad (Ladino)lb (Luxembourgish)lbe (Lak)lez (Lezghian)lg (Ganda)li (Limburgan)lij (Ligurian)lmo (Lombard)ln (Lingala)lo (Lao)lrc (Northern Luri)lt (Lithuanian)ltg (Latgalian)lv (Latvian)

mai (Maithili)mdf (Moksha)mg (Malagasy)mh (Marshallese)mhr (Eastern Mari)mi (Maori)min (Minangkabau)mk (Macedonian)ml (Malayalam)mn (Mongolian)mr (Marathi)mrj (Western Mari)ms (Malay)mt (Maltese)mwl (Mirandese)my (Burmese)myv (Erzya)mzn (Mazanderani)

na (Nauru)nap (Neapolitan)nds (Low German)ne (Nepali)new (Newari)ng (Ndonga)nl (Dutch)nn (Norwegian Nynorsk)no (Norwegian)nov (Novial)nrm (Narom)nso (Pedi)nv (Navajo)ny (Nyanja)

oc (Occitan)olo (Livvi)om (Oromo)or (Oriya)os (Ossetian)

pa (Panjabi)pag (Pangasinan)pam (Pampanga)pap (Papiamento)pcd (Picard)pdc (Pennsylvania German)pfl (Pfaelzisch)pi (Pali)pih (Pitcairn-Norfolk)pl (Polish)pms (Piemontese)pnb (Western Panjabi)pnt (Pontic)ps (Pushto)pt (Portuguese)

qu (Quechua)

rm (Romansh)rmy (Vlax Romani)rn (Rundi)ro (Romanian)ru (Russian)rue (Rusyn)rw (Kinyarwanda)

sa (Sanskrit)sah (Yakut)sc (Sardinian)scn (Sicilian)sco (Scots)sd (Sindhi)se (Northern Sami)sg (Sango)sh (Serbo-Croatian)si (Sinhala)sk (Slovak)sl (Slovenian)sm (Samoan)sn (Shona)so (Somali)sq (Albanian)sr (Serbian)srn (Sranan Tongo)ss (Swati)st (Southern Sotho)stq (Saterfriesisch)su (Sundanese)sv (Swedish)sw (Swahili)szl (Silesian)

ta (Tamil)tcy (Tulu)te (Telugu)tet (Tetum)tg (Tajik)th (Thai)ti (Tigrinya)tk (Turkmen)tl (Tagalog)tn (Tswana)to (Tonga)tpi (Tok Pisin)tr (Turkish)ts (Tsonga)tt (Tatar)tum (Tumbuka)tw (Twi)ty (Tahitian)tyv (Tuvinian)

udm (Udmurt)ug (Uighur)uk (Ukrainian)ur (Urdu)uz (Uzbek)

ve (Venda)vec (Venetian)vep (Veps)vi (Vietnamese)vls (Vlaams)vo (Volapük)

wa (Walloon)war (Waray)wo (Wolof)wuu (Wu Chinese)

xal (Kalmyk)xh (Xhosa)xmf (Mingrelian)

yi (Yiddish)yo (Yoruba)

za (Zhuang)zea (Zeeuws)zh (Chinese)zu (Zulu)

MultiBPEmb

multi (multilingual)

Citing BPEmb

If you use BPEmb in academic work, please cite:

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
Comments
  • adding special tokens to a BPEmb model

    adding special tokens to a BPEmb model

    Hi,

    Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g. <name>, <digit>, etc.

    This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.

    Thanks in advance!

  • Symbols don't match between model and embedding

    Symbols don't match between model and embedding

    It seems that symbols from the sentencepiece model and the associated embeddings are not the same, specifically control symbols are not present in the embedding. For instance, in de.wiki.bpe.op50000.model there are the symbols <s>, </s> and <unk>, but in the associated embedding de.wiki.bpe.vs50000.d300.vectors.bin only <unk> is defined.

    Update: Further exploration reveals that the number of common symbols between the aforementioned 2 files are 49631.

  • Vocab length != word vector count

    Vocab length != word vector count

    Hey, there is an inconsistency between the count of words in the vocabulary file of the English word vectors for 25000 BPE merges and the count of word vectors with a dimensionality of 200. In the vocab file, there is a count of 25000 words, but the .bin file contains 25777 word vectors after loading it into gensim.

    Additionally, the order of the word vectors differs from the vocab file. For example, the word "▁explanation" is on position 19387 in the vocab file, but on position 9138 in the .bin file.

    It would be very helpful if the index of the vocabulary in all files would be in the same order and have the same count.

  • Some embeddings are invalid (majority of vectors is inf or nan)

    Some embeddings are invalid (majority of vectors is inf or nan)

    Firstly thanks for your efforts in providing the pretrained embeddings.

    Unfortunately some of the embeddings are not trained correctly. For instance, the d300 embeddings for the 10k model for English contain 9640 vectors with inf entries, out of 10817 total. It would be great if you can provide your training script, or double check it and upload fixed vectors. The d100 embeddings that you use in the Readme are indeed fine and do not contain any inf values.

    Furthermore, while debugging this issue I noticed that the embeddings contain Chinese characters at the following indices[10345, 10451, 10458, 10475, 10514, 10531, 10539, 10541, 10601, 10606, 10609, 10622, 10627, 10632, 10633, 10638, 10657, 10702, 10740, 10750, 10755, 10756, 10762, 10781, 10790, 10791, 10802, 10809, 10810, 10815]. Perhaps it would be sensible to filter out sentences containing Chinese characters from the training corpus?

  • EOFError: Compressed file ended before the end-of-stream marker was reached

    EOFError: Compressed file ended before the end-of-stream marker was reached

    Dear @all,

    I'm trying to load the Dutch BPEmb model with vocabulary size 50k and 100-dimensional embeddings.

    bpemb_de = BPEmb(lang="de", vs=50000)

    I got an EOFError error:

    EOFError Traceback (most recent call last) in 1 import bpemb 2 from bpemb import BPEmb ----> 3 bpemb_de = BPEmb(lang="de", vs=50000)

    ~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in init(self, lang, vs, dim, cache_dir, preprocess, encode_extra_options, add_pad_emb, vs_fallback, segmentation_only, model_file, emb_file) 188 else: 189 emb_file = self.emb_tpl.format(lang=lang, vs=vs, dim=dim) --> 190 self.emb_file = self._load_file(emb_file, archive=True) 191 self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb) 192 self.most_similar = self.emb.most_similar

    ~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in _load_file(self, file, archive, cache_dir) 226 file_url = self.base_url + file + suffix 227 print("downloading", file_url) --> 228 return http_get(file_url, cached_file, ignore_tardir=True) 229 230 def repr(self):

    ~/anaconda3/lib/python3.6/site-packages/bpemb/util.py in http_get(url, outfile, ignore_tardir) 47 import tarfile 48 tf = tarfile.open(fileobj=temp_file) ---> 49 members = tf.getmembers() 50 if len(members) != 1: 51 raise NotImplementedError("TODO: extract multiple files")

    ~/anaconda3/lib/python3.6/tarfile.py in getmembers(self) 1759 self._check() 1760 if not self._loaded: # if we want to obtain a list of -> 1761 self._load() # all members, we first have to 1762 # scan the whole archive. 1763 return self.members

    ~/anaconda3/lib/python3.6/tarfile.py in _load(self) 2356 """ 2357 while True: -> 2358 tarinfo = self.next() 2359 if tarinfo is None: 2360 break

    ~/anaconda3/lib/python3.6/tarfile.py in next(self) 2287 # Advance the file pointer. 2288 if self.offset != self.fileobj.tell(): -> 2289 self.fileobj.seek(self.offset - 1) 2290 if not self.fileobj.read(1): 2291 raise ReadError("unexpected end of data")

    ~/anaconda3/lib/python3.6/gzip.py in seek(self, offset, whence) 366 elif self.mode == READ: 367 self._check_not_closed() --> 368 return self._buffer.seek(offset, whence) 369 370 return self.offset

    ~/anaconda3/lib/python3.6/_compression.py in seek(self, offset, whence) 141 # Read and discard data until we reach the desired position. 142 while offset > 0: --> 143 data = self.read(min(io.DEFAULT_BUFFER_SIZE, offset)) 144 if not data: 145 break

    ~/anaconda3/lib/python3.6/gzip.py in read(self, size) 480 break 481 if buf == b"": --> 482 raise EOFError("Compressed file ended before the " 483 "end-of-stream marker was reached") 484

    EOFError: Compressed file ended before the end-of-stream marker was reached

    Kindly, any suggestions to fix this issue !!

  • question on https://nlp.h-its.org

    question on https://nlp.h-its.org

    Hi,

    Now that issue https://github.com/bheinzerling/bpemb/issues/34 is sorted out. For the R wrapper of sentencepiece at https://github.com/bnosac/sentencepiece, I'm planning to implement a simple wrapper to download the models you have been providing and next put the package on CRAN. Before I do this, I would like to know the intention of that site: https://nlp.h-its.org Namely

    • Who maintains that site
    • Will that site persist over time
    • Any objections in redistributing these sentencepiece models and glove embeddings? Either license-wise or other.

    Thanks for any input.

  • version of sentencepiece used

    version of sentencepiece used

    hi, I'm writing an R wrapper around sentencepiece and tried to load a few of the models and vocabulary provided here. I have found some inconsistencies. In order to make sure this is not related to the version of SentencePiece you used, can you let me know with which version / commit of SentencePiece the models were constructed? I'm making the R wrapper around sentencepiece release v0.1.84 from Oct 12, 2019.

  • Adding support for own models

    Adding support for own models

    Hi,

    First of all, thanks for the great package. Currently, the only way to use my own models with bpemb is to first load another model, and then assign the .spm and .emb attributes manually. This is a bit unwieldy.

    I am interested in adding a subclass of BPEmb that overrides the __init__ of BPEmb and simply accepts paths to an spm and emb model/file, from which the other attributes (e.g. size/vs) are derived. Is this something you would accept as a PR? Do you see any problems with this approach?

    Thanks! Stéphan

  • How do you get the embedding/id for the pad token ?

    How do you get the embedding/id for the pad token ?

    Hi, This may be a dummy question, but when creating a BPEmb with add_pad_emb=True, how do I actually get the padding embedding and what is its ID ? This should maybe figure somewhere in the doc:

    from bpemb import BPEmb
    bp = BPEmb(lang="en", vs=1000, dim=50, add_pad_emb=True)
    print(bp.vs) # prints 1000, was expecting 1000+1 ?
    

    Thanks for the great work, derlin

  • numbers/digits conversion

    numbers/digits conversion

    Hi there, thanks for providing such wonderful pretrained models, I noticed the pretrained model converted all digits/numbers to 0. May I ask why? Numbers are very useful and sometime crucial for downstream NLP tasks such as question answering (e.g. when asking questions related to numbers).

  • Missing tokens in German model

    Missing tokens in German model

    For the german models, there seem to be tokens missing in the vector models (200k merge operations) that are present in the vocab list and sentencepiece models. For example ▁plaisor auens are in the vocab but fail to look up in the model.

    The number of tokens in the model len(model.index2word) returns 185712, while in the vocab list there are 200k entries.

    Am I missing something here or is there some miss-match between the training of the sentencepiece tokens / glove vectors?

    Great resource otherwise, thanks!

  • Truecase supported.

    Truecase supported.

    I'm working on a machine translation task. When I encode corpus with bpemb, the output is always lower case. Is it possible to retain case information after encode my corpus?

  • Train --model_type=unigram

    Train --model_type=unigram

    Thank you for making this resource.

    It would be nice if we can use the model trained with --model_type=unigram, which is a default mode of sentencepiece.

    BPE and Unigram show basically the same BLUE score, but unigram is more flexible and supports subword sampling. https://arxiv.org/abs/1804.10959

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Sep 22, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Dec 23, 2021
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Jul 1, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Apr 28, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

Jan 18, 2022
Reading Wikipedia to Answer Open-Domain Questions
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Sep 20, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also very accurate having competitive accuracy with state-of-the-art open-domain QA models

Sep 16, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Aug 29, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

Sep 7, 2022
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

Jul 17, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Sep 16, 2022
Google and Stanford University released a new pre-trained model called ELECTRA
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Sep 16, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Apr 29, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

Aug 22, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Sep 11, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

Aug 28, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

Sep 5, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

Sep 22, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Sep 18, 2022