Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi

CodeT5 demo

Updates

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently includes two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

  • Text-to-code generation: generate code based on the natural language description.
  • Code autocompletion: complete the whole function of code given the target function name.
  • Code summarization: generate the summary of a function in natural language description.

Table of Contents

  1. Citation
  2. License
  3. Dependency
  4. Download
  5. Fine-tuning
  6. Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

  • Pytorch 1.7.1
  • tensorboard 2.4.1
  • transformers 4.6.1
  • tree-sitter 0.2.2

Download

Instructions to download:

pip install gsutil

gsutil -m cp -r "gs://sfr-codet5-data-research/data/" .

mkdir pretrained_models; cd pretrained_models
gsutil -m cp -r \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_small" \
  "gs://sfr-codet5-data-research/pretrained_models/codet5_base" \
  .

The repository structure will look like the following after the download:

├── CODE_OF_CONDUCT.md
├── README.md
├── SECURITY.md
├── codet5.gif
├── configs.py
├── models.py
├── run_clone.py
├── run_gen.py
├── utils.py
├── _utils.py
├── LICENSE.txt
├── data
│   ├── clone
│   ├── concode
│   ├── defect
│   ├── refine
│   │   ├── medium
│   │   └── small
│   ├── summarize
│   │   ├── go
│   │   ├── java
│   │   ├── javascript
│   │   ├── php
│   │   ├── python
│   │   └── ruby
│   └── translate
├── evaluator
│   ├── bleu.py
│   ├── smooth_bleu.py
│   └── CodeBLEU
├── pretrained_models
│   ├── codet5_base
│   └── codet5_small
├── sh
│   ├── exp_with_args.sh
│   ├── run_exp.py
│   ├── results
│   ├── saved_models
│   └── tensorboard
└── tokenizer
    └── salesforce
        ├── codet5-merges.txt
        └── codet5-vocab.json    

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your downloaded CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on.

For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also revise the suggested arguments here and refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Comments
  • 'tuple' object has no attribute 'loss'

    'tuple' object has no attribute 'loss'

    Hi, I want to run CodeT5-base on code generation task. I run the command: python run_exp.py --model_tag codet5_base --task concode --sub_task none

    There is an error: 'tuple' object has no attribute 'loss'. CodeT5_img2

    I try to change outputs = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask) to outputs, _ = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)

    There is an error: too many values to unpack (expected 2)

    What should I do?

  • Is the released pre-trained model including the dual generation pre-training

    Is the released pre-trained model including the dual generation pre-training

    Dear authors,

    I noticed in the paper you mentioned that you pre-train the T5 model with identifier-aware denoising for 100 epochs and further pre-train with bimodal generation for 50 epochs. I was wondering the released model only includes the first 100 epochs or the whole 150 epochs?

    Thanks in advance for your clarification

  • Inference for java code summarization

    Inference for java code summarization

    Is it possible to make code summarization for raw Java code?

    I can't find the example of inference for code summarization. Could you please provide an example? E.g., I expect the following code:

    from transformers import RobertaTokenizer,  WHICH_MODELTO_USE
    
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = WHICH_MODELTO_USE.from_pretrained('Salesforce/codet5-base')
    
    java_code = 'int i = 0; ++i;  int b = runSomeFunction(i); extract(b);'
    code_summarization = model.predict(java_code)
    print(code_summarization)
    

    The expected result is the following: 'Extracts and returns max value'

    Is it possible to make such the prediction? The problem is I can't understand how you are translating from code to the vector which will be used to predict the summarization without pretraining procedures.

    Could you please provide an example?

  • Fined-tuned checkpoints -> Code clone detection

    Fined-tuned checkpoints -> Code clone detection

    Hi,

    I am hoping to reproduce the results on code clone detection task. This might seem like a silly question but the fined-tuned checkpoints released doesn't include the RobertaClassificationHead parameters, right? I am able to load only the T5ForConditionalGeneration model using the provided checkpoints for the task.

    So, how do I go about loading the entire CloneModel?

  • Can the CodeT5 model do code autocompletion without fine-tuning?

    Can the CodeT5 model do code autocompletion without fine-tuning?

    The readme mentioned that this is used for Code autocompletion in VSCode, I wonder how to use CodeT5 without fine-tuning as a language model to complete code given some previous context in code?

  • Regarding Code Generation task.

    Regarding Code Generation task.

    Can I use it for Code generation? For example if I give a query, "Add two numbers", and it should generate the code for that. And if Yes, can you please suggest how can I prepare the dataset for this task or can I use the dataset which you mentioned.

    Thank you

  • Pretrained model for prediction

    Pretrained model for prediction

    Can you kindly elaborate on how we can use the fine-tuned checkpoints for the prediction of new data in concode task? Say this is my prediction data: {"code": "public integer sum(Integer arg0,Integer arg1) {return result;}", "nl": "Add two integers. concode_field_sep int sum concode_field_sep int result"} If I understand correctly then concode is supposed to complete these functions. However, I am not sure how to generate prediction on this sample data. I tried replacing the test file containing original test data with this sample test data and then ran this command python run_exp.py --model_tag codet5_small --task concode --sub_task none This command starts with training, then evaluating and finally testing. However, I am interested in only prediction. Isn't there any way to directly generate predictions from fine-tuned model on concode ? Kindly let me know if I am doing something wrong.

  • About  AI coding assistant demo

    About AI coding assistant demo

    Hi, the newly added AI coding assistant demo is cool! I have a few questions about it:

    1. Did you make the codeT5 model into VS Code plugin? How did you do that?

    2. When demonstrating code generation, editing the comment generates the corresponding code snippet. Aren't the inputs to the code generation model natural Language Description and class Environment?

    捕获

    The format of the data in the Concode dataset is

    {
        "code": "int function ( double [ ] arg0 , double [ ] arg1 ) { int loc0 = arg0 . length - arg1 . length ; outer : for ( int loc1 = 0 ; loc1 <= loc0 ; loc1 ++ ) { for ( int loc2 = 0 ; loc2 < arg1 . length ; loc2 ++ ) { if ( ne ( arg0 [ loc1 + loc2 ] , arg1 [ loc2 ] ) ) { continue outer ; } } return ( loc1 ) ; } return ( - 1 ) ; }",
        "nl": "searches for the first subsequence of a that matches sub elementwise . elements of sub are considered to match elements of a if they pass the #eq test . concode_field_sep double max_ratio concode_elem_sep double min_ratio concode_elem_sep boolean off concode_field_sep boolean isElemMatch concode_elem_sep int compare concode_elem_sep boolean isSubset concode_elem_sep boolean ne concode_elem_sep boolean lt concode_elem_sep boolean gte concode_elem_sep void set_rel_diff concode_elem_sep boolean eq concode_elem_sep boolean lte concode_elem_sep boolean gt"
    }
    

    Is it possible to generate accurate code snippet by typing only comments without class Environment? Doesn't the loss of context information affect the quality of the generated code?

  • Can we print a model summary of CodeT5 model?

    Can we print a model summary of CodeT5 model?

    I want to know how to print a model summary of CodeT5-base model that are archived at the huggingface webpage.

    • https://huggingface.co/Salesforce/codet5-base
    • https://arxiv.org/abs/2109.00859

    We could print the model summary with the torchinfo and torchsummary modules as follows in the case of AlextNet. In the case of CodeT5-base model, How can we print the model summary?

    • Case: Alexnet
    from torchsummary import summary
    help(summary)
    import torchvision.models as models
    alexnet = models.alexnet(pretrained=False)
    alexnet.cuda()
    summary(alexnet, (3, 224, 224))
    print(alexnet)
    
    • Case: Convnet
    from torchinfo import summary
    model = ConvNet()
    batch_size = 16
    summary(model, input_size=(batch_size, 1, 28, 28))
    
    • Case: BERT
    import torch
    from torchvision import models
    from torchsummary import summary
    dt = 2020
    torch.manual_seed(dt)
    torch.backends.cudnn.deterministic = True
    from transformers import BertTokenizer
    pretrainedmodel_vgg = models.vgg16()
    BT = BertTokenizer.from_pretrained('bert-base-uncased')
    len(BT)
    bertresult = BT.tokenize('Hi!! Welcome To The PythonGuides')
    print(bertresult)
    summary(pretrainedmodel_vgg, (3, 224, 224)
    
    • Case Lightening (encoder/decoder):
    import os
    import torch
    from torch import nn
    import torch.nn.functional as func
    from torchvision.datasets import MNIST
    from torch.utils.data import DataLoader, random_split
    from torchvision import transforms
    import pytorch_lightning as pylig
    class litauto_encoder(pylig.LightningModule):
        def __init__(self):
            super().__init__()
            self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
            self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))
    
        def forward(self, m):
            embding = self.encoder(m)
            return embding
    
        def training_step(self, btch, btch_indx):
            m, n = btch
            m = m.view(m.size(0), -1)
            o = self.encoder(m)
            m_hat = self.decoder(o)
            losses = func.mse_loss(m_hat, m)
            self.log("train_loss", losses)
            return losses
    
        def configure_optimizers(self):
            optim = torch.optim.Adam(self.parameters(), lr=1e-3)
            return optim
    dt = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
    trained, valid = random_split(dt, [55000, 5000])
    
    autoencoder = litauto_encoder()
    traine = pylig.Trainer()
    traine.fit(autoencoder, DataLoader(trained), DataLoader(valid))
    summary(litauto_encoder,(1,28,28))
    
  • Code for normalizing variables

    Code for normalizing variables

    Hi,

    I am trying to reproduce refine task on my data. I see that dataset for refine has abstracted the types and variables eg

    private void METHOD_1 ( java.lang.Class VAR_1 )...
    

    Is the code to do this provided in utils.py? If not, how to go about this?

  • Beam search for generation task

    Beam search for generation task

    Dear authors,

    I'd like to know what should I do if I wish to output a certain number of top results (e.g. top-10) in the code generation task.

    Currently, it seems that only the top-1 result is returned here: https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L105

  • Task Control Codes

    Task Control Codes

    I was using hugging faces codet5-base to try code generation and understanding tasks.

    I cannot find documentation anywhere that indicates how to use "task control codes" for different input types.

    I am trying to do something like Figure 1 in the original paper

  • Reproducing translation results using the released finetuned checkpoint

    Reproducing translation results using the released finetuned checkpoint

    Hi! I evaluated your finetuned checkpoint on java-cs translation but could not get the exactly same results as your paper reported. I got 83.89/64.7 but the paper reported 84.03/65.9. I read that you use beam-search w/o sampling to generate the results, which should not bring randomness, so I'm wondering where did the randomness come from.

    This is my output: image

    I downloaded the checkpoint from here: (and I used translate_java_cs_codet5_base.bin) image

    Thank you!

  • Generation task (SysML)

    Generation task (SysML)

    I want to create my own dataset for the generation task. I want to convert text to SySML code. The SysML code examples are here (https://github.com/Systems-Modeling/SysML-v2-Release/tree/master/sysml/src/examples).

    In the data/concode folder I want to give my own dev, test, and train.json files. But the thing is will I be able to generate SysML code? Is the current codeT5 compatible with my task? I am starting to research this code. Any suggestions or ideas are appreciable. Thanks.

  • How to get embedding for javascript and python code snippet?

    How to get embedding for javascript and python code snippet?

    I have a couple of questions:

    a) How can I use CodeT5 to extract embedding for JavaScript and Python code? b) Can I feed incomplete code JavaScript and Python snippet to extract embedding? Or the code snippet needs to be complete? c) Have anyone used CodeT5 to perform code to code search?

Google and Stanford University released a new pre-trained model called ELECTRA
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Sep 16, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Sep 13, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Jul 1, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

Sep 7, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

Sep 5, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

Sep 22, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Aug 14, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

Aug 22, 2022
Guide to using pre-trained large language models of source code
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Sep 23, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Sep 11, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

Aug 28, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Sep 16, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

Sep 22, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

Oct 17, 2021
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Sep 20, 2022
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Sep 24, 2022
Aug 17, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Apr 29, 2022
A fast and lightweight python-based CTC beam search decoder for speech recognition.
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Sep 20, 2022