Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

  • EDA (Baseline)
  • Backtranslation (Baseline)
  • CBERT (Baseline)
  • BERT Prepend (Our paper)
  • GPT-2 Prepend (Our paper)
  • BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

  1. Download data from github
  2. Replace numeric labels with text for STSA-2 and TREC dataset
  3. For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

  • Pytorch 1.5
  • fairseq 0.9
  • transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

  • run scripts/bart_snips_lower.sh for BART experiment
  • run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Similar Resources

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

Aug 28, 2022

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Pytorch code for ICRA'21 paper:

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

Sep 5, 2022

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Code for our paper

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

Sep 15, 2022

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Sep 22, 2022

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

Sep 13, 2022

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

Jun 27, 2022

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

Sep 12, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

Sep 22, 2022

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Code for paper

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Sep 15, 2022
Comments
  • Where to find BART Data Augmentation Data

    Where to find BART Data Augmentation Data

    Hello,

    I have been able to run the bart_snips_lower.sh script without error. I do get a warning:

    UserWarning: This overload of nonzero is deprecated: nonzero()

    Which I think is related to a mismatch in versions between my pytorch 1.6 and python installation (3.8). However, despite that being the only warning output to my terminal console, running the model produces empty text files: bart_l5_gen_3.tsv

    These .tsv files seem to be where the augmented data examples should be; but because they're empty files, I'm unsure if I'm looking in the right place.

    Could you provide direction on the following questions:

    1. Are the augmented data files saved in the following location $MODELDIR/bart_l5_gen_${PREFIXSIZE}.tsv ?
    2. Should these files be empty?
    3. What Python version are you using in your project?
    4. What cuda toolkit version are you using in your project?
    5. Are you using torch 1.6 or torch 1.5

    Thanks!

Predict an emoji that is associated with a text
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Sep 7, 2022
Associated Repository for "Translation between Molecules and Natural Language"
Associated Repository for

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

Sep 13, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Nov 9, 2021
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

Sep 22, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Sep 26, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Feb 17, 2021
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Sep 7, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"
This repository will contain the code for the CVPR 2021 paper

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

Sep 20, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".
Code for ACL 2021 main conference paper

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

Sep 6, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"
Code from the paper

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Sep 17, 2022