๐Ÿค—Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.



Build GitHub Documentation GitHub release Contributor Covenant

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0

๐Ÿค— Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

๐Ÿค— Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.

๐Ÿค— Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

Online demos

You can test most of our models directly on their pages from the model hub. We also offer private model hosting, versioning, & an inference API to use those models.

Here are a few examples:

Write With Transformer, built by the Hugging Face team, is the official demo of this repoโ€™s text generation capabilities.

Quick tour

To immediately use a model on a given text, we provide the pipeline API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts

>>> from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to include pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.

This is another example of pipeline used for that can extract question answers from some context:

>>> from transformers import pipeline

# Allocate a pipeline for question-answering
>>> question_answerer = pipeline('question-answering')
>>> question_answerer({
...     'question': 'What is the name of the repository ?',
...     'context': 'Pipeline have been included in the huggingface/transformers repository'
... })
{'score': 0.5135612454720828, 'start': 35, 'end': 59, 'answer': 'huggingface/transformers'}

On top of the answer, the pretrained model used here returned its confidence score, along with the start position and its end position in the tokenized sentence. You can learn more about the tasks supported by the pipeline API in this tutorial.

To download and use any of the pretrained models on your given task, you just need to use those three lines of codes (PyTorch version):

>>> from transformers import AutoTokenizer, AutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> model = AutoModel.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)

or for TensorFlow:

>>> from transformers import AutoTokenizer, TFAutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)

The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on one (or list) of texts (as we can see on the fourth line of both code examples). It will output a dictionary you can directly pass to your model (which is done on the fifth line).

The model itself is a regular Pytorch nn.Module or a TensorFlow tf.keras.Model (depending on your backend) which you can use normally. For instance, this tutorial explains how to integrate such a model in classic PyTorch or TensorFlow training loop, or how to use our Trainer API to quickly fine-tune the on a new dataset.

Why should I use transformers?

  1. Easy-to-use state-of-the-art models:

    • High performance on NLU and NLG tasks.
    • Low barrier to entry for educators and practitioners.
    • Few user-facing abstractions with just three classes to learn.
    • A unified API for using all our pretrained models.
  2. Lower compute costs, smaller carbon footprint:

    • Researchers can share trained models instead of always retraining.
    • Practitioners can reduce compute time and production costs.
    • Dozens of architectures with over 2,000 pretrained models, some in more than 100 languages.
  3. Choose the right framework for every part of a model's lifetime:

    • Train state-of-the-art models in 3 lines of code.
    • Move a single model between TF2.0/PyTorch frameworks at will.
    • Seamlessly pick the right framework for training, evaluation, production.
  4. Easily customize a model or an example to your needs:

    • Examples for each architecture to reproduce the results by the official authors of said architecture.
    • Expose the models internal as consistently as possible.
    • Model files can be used independently of the library for quick experiments.

Why shouldn't I use transformers?

  • This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving in additional abstractions/files.
  • The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library.
  • While we strive to present as many use cases as possible, the scripts in our examples folder are just that: examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs.

Installation

With pip

This repository is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.

You should install ๐Ÿค— Transformers in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install at least one of TensorFlow 2.0, PyTorch or Flax. Please refer to TensorFlow installation page, PyTorch installation page regarding the specific install command for your platform and/or Flax installation page.

When TensorFlow 2.0 and/or PyTorch has been installed, ๐Ÿค— Transformers can be installed using pip as follows:

pip install transformers

If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must install the library from source.

With conda

Since Transformers version v4.0.0, we now have a conda channel: huggingface.

๐Ÿค— Transformers can be installed using conda as follows:

conda install -c huggingface transformers

Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.

Models architectures

All the model checkpoints provided by ๐Ÿค— Transformers are seamlessly integrated from the huggingface.co model hub where they are uploaded directly by users and organizations.

Current number of checkpoints:

๐Ÿค— Transformers currently provides the following architectures (see here for a high-level summary of each them):

  1. ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  2. BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
  3. BARThez (from ร‰cole polytechnique) released with the paper BARThez: a Skilled Pretrained French Sequence-to-Sequence Model by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
  4. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  5. BERT For Sequence Generation (from Google) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
  6. Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  7. BlenderbotSmall (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
  8. BORT (from Alexa) released with the paper Optimal Subarchitecture Extraction For BERT by Adrian de Wynter and Daniel J. Perry.
  9. CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suรกrez*, Yoann Dupont, Laurent Romary, ร‰ric Villemonte de la Clergerie, Djamรฉ Seddah and Benoรฎt Sagot.
  10. ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
  11. CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
  12. DeBERTa (from Microsoft Research) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
  13. DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
  14. DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.
  15. DPR (from Facebook) released with the paper Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas OฤŸuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
  16. ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
  17. FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loรฏc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoรฎt Crabbรฉ, Laurent Besacier, Didier Schwab.
  18. Funnel Transformer (from CMU/Google Brain) released with the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
  19. GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
  20. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
  21. LayoutLM (from Microsoft Research Asia) released with the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
  22. LED (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  23. Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.
  24. LXMERT (from UNC Chapel Hill) released with the paper LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering by Hao Tan and Mohit Bansal.
  25. MarianMT Machine translation models trained using OPUS data by Jรถrg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.
  26. MBart (from Facebook) released with the paper Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
  27. MPNet (from Microsoft Research) released with the paper MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
  28. MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
  29. Pegasus (from Google) released with the paper PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
  30. ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  31. Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, ลukasz Kaiser, Anselm Levskaya.
  32. RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
  33. SqueezeBert released with the paper SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
  34. T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
  35. TAPAS (from Google AI) released with the paper TAPAS: Weakly Supervised Table Parsing via Pre-training by Jonathan Herzig, Paweล‚ Krzysztof Nowak, Thomas Mรผller, Francesco Piccinno and Julian Martin Eisenschlos.
  36. Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
  37. Wav2Vec2 (from Facebook AI) released with the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
  38. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  39. XLM-ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
  40. XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmรกn, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
  41. XLNet (from Google/CMU) released with the paper โ€‹XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
  42. Want to contribute a new model? We have added a detailed guide and templates to guide you in the process of adding a new model. You can find them in the templates folder of the repository. Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

To check if each model has an implementation in PyTorch/TensorFlow/Flax or has an associated tokenizer backed by the ๐Ÿค— Tokenizers library, refer to this table

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the documentation.

Learn more

Section Description
Documentation Full API documentation and tutorials
Task summary Tasks supported by ๐Ÿค— Transformers
Preprocessing tutorial Using the Tokenizer class to prepare data for the models
Training and fine-tuning Using the models provided by ๐Ÿค— Transformers in a PyTorch/TensorFlow training loop and the Trainer API
Quick tour: Fine-tuning/usage scripts Example scripts for fine-tuning models on a wide range of tasks
Model sharing and uploading Upload and share your fine-tuned models with the community
Migration Migrate to ๐Ÿค— Transformers from pytorch-transformers or pytorch-pretrained-bert

Citation

We now have a paper you can cite for the ๐Ÿค— Transformers library:

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rรฉmi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}
Owner
Comments
  • How to use BERT for finding similar sentences or similar news?

    How to use BERT for finding similar sentences or similar news?

    I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

  • Summarization Fine Tuning

    Summarization Fine Tuning

    โ“ Questions & Help

    Details

    I tried using T5 and Bart but the abstraction summarization on scientific texts does not seem to give the results I want since I think they are both trained on news corpora. I have scraped all of the free PMC articles and I am thinking about fine-tuning a seq2seq model between the articles and their abstracts to make an abstractive summarizer for scientific texts. This Medium article (https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8) provides a bit of an introduction to how to approach this but does not quite go into detail so I am wondering how to approach this.

    I'm not really asking for help being stuck but I just don't really know how to approach this problem.

    A link to original question on Stack Overflow: https://stackoverflow.com/questions/61826443/train-custom-seq2seq-transformers-model

  • GPT-J-6B

    GPT-J-6B

    What does this PR do?

    Introduces the long awaited GPT J model class to HuggingFace! Concurrently with this PR being merged I will make a GPT J 6B checkpoint public on the EleutherAI HF page for people to use. The model has been evaluated as being within error tolerances of the GPT J 6B model we released in Jax two months ago.

    @patil-suraj was very helpful in assisting me to understand HF philosophy and how to make this PR most in line with the rest of the codebase. Other than that, the major design consideration was to make the configs compatible with GPT-2 rather than GPT-Neo. GPT-Neo has some usability limitations due to its configs having names unrelated to GPT-2โ€™s (see #12183 for details). Given those problems and my hope that GPT-Neo will have itโ€™s configs updated in the future, it seemed like a clear choice to align GPT J with GPT-2.

    Shout outs to @finetuneanon whose implementation this one is based off of, as well as @kumuruz for assistence optimizing and debugging.

    Supersedes #12243 #13010 #13022

    Closes #12098

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [X] Did you read the contributor guideline, Pull Request section?
    • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. It was discussed in Slack with @patil-suraj
    • [X] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [X] Did you write any new necessary tests?

    Who can review?

    • gpt2: @patrickvonplaten, @LysandreJik, @patil-suraj
  • [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu

    [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu

    Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)

    Thank you, @PeterAJansen for letting me use your hardware!

    Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.

    Sharing details for those who need.

    If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in

    Well, it's similar to the t5-3b on 24GB success reported here and here. But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)

    As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:

    • for t5-3b on 1x 24GB gpu: ~71GB RAM
    • for t5-11b on 1x 40GB gpu: ~234GB RAM

    I was using /usr/bin/time -v program to get the peak memory measurement - it's the Maximum resident set size entry in the final report.

    Question: I don't think /usr/bin/time does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.

    Batch sizes on one gpu:

    • with buffers of 5e8 I was able to run BS=2, which might be too small for training,
    • but with 2e8 I managed to squeeze in BS=10 for training, but OOMed on prediction

    I'm referring to these batch sizes in ds_config.json:

            "allgather_bucket_size": 2e8,
            "reduce_bucket_size": 2e8,
    

    And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.

    edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use BS=2 for real work, until things get optimized even more.

    edit2: things are getting re-shuffling in the tests, so the default ds_config.json file has moved in master to a new, hopefully permanent home. It's now at examples/tests/deepspeed/ds_config.json so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.

    here is the full benchmark:

    # 1 gpu: 
    # only training fits with this BS, eval needs a smaller BS
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 31.0897, 'train_samples_per_second': 0.257, 'epoch': 1.0}
    
    # 2 gpus:
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 17.9026, 'train_samples_per_second': 0.223, 'epoch': 1.0}
    
    # 4 gpus
    
    export BS=8; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
    
    {'train_runtime': 10.4404, 'train_samples_per_second': 0.192, 'epoch': 1.0}
    

    Checkpointing should allow making even bigger batch sizes.

  • FP16 overflow with GPT-Neo when using sequence lengths of 2048.

    FP16 overflow with GPT-Neo when using sequence lengths of 2048.

    Environment info

    • transformers version: 4.5.0.dev0
    • Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
    • Python version: 3.8.5
    • PyTorch version (GPU?): 1.8.0+cu111
    • Tensorflow version (GPU?): N/A
    • Using GPU in script?: Yes
    • Using distributed or parallel set-up in script?: No

    Who can help

    @stas00

    Models:

    • GPT-Neo 1.3b

    Library:

    • deepspeed: @stas00

    Information

    Model I am using (Bert, XLNet ...):

    The problem arises when using:

    • [ ] the official example scripts: (give details below)
    • [x] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [ ] an official GLUE/SQUaD task: (give the name)
    • [x] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
    2. Enable FP16 and set max_length to 2048
    3. Observe that all loses reported are NaN

    Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

    When the max_length is shorter (512) this overflow does not occur.

    Expected behavior

    I expected no overflows.

    Aside

    I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

  • How to use fine-tuned BART for prediction?

    How to use fine-tuned BART for prediction?

    โ“ Questions & Help

    Details

    I fine-tuned the BART model on a custom summarization dataset using the transformers/examples/summarization/bart/finetune.py and transformers/examples/summarization/bart/run_train.sh files in the repository for training (which generated three checkpointepoch=*.ckpt files) and prediction (which generated a .txt file with the test loss scores).

    I have two questions on using this model for prediction:

    • How can I modify finetune.py to generate predictions for the test set, in addition to the loss scores? I see some test functions in finetune.py, but I'm not sure how to use these for generating a .txt file with the predictions.

    • How can I load the generated .ckpt files into BartForConditionalGeneration()? A config.json file was not generated along with the checkpoint files; there doesn't seem to be a TFBartForConditionalGeneration; and the convert_tf_checkpoint_to_pytorch.py script in the repo doesn't seem to support BART yet.

    Thank you for your time!

  • Add TF ViT MAE

    Add TF ViT MAE

    This PR adds the MAE [1] model in TensorFlow. It was developed by @arig23498 and myself.

    Fun facts about this PR:

    • Probably the third pure vision model in TensorFlow in transformers.

    References:

    [1] Masked Autoencoders Are Scalable Vision Learners

    Update

    The PR is now ready for review. @gante @Rocketknight1 @sgugger

  • Add TFConvNextModel

    Add TFConvNextModel

    This PR adds the ConvNeXt [1] model in TensorFlow. It was developed by @arig23498, @gante, and myself.

    Fun facts about this PR:

    • Probably the first pure conv model in transformers.
    • Probably the second pure vision model in TensorFlow in transformers.

    References:

    [1] A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545.

    @gante @LysandreJik @Rocketknight1

  • Pegasus finetuning: OOM

    Pegasus finetuning: OOM

    Epoch 0: 91% 5747/6331 [39:52<04:03, 2.40it/s, loss=75.765, v_num=2]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) tcmalloc: large alloc 1083260928 bytes == 0x1aece0000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1354080256 bytes == 0x21e5c000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 1692606464 bytes == 0x7f10651ce000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2115764224 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 2644705280 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 3305881600 bytes == 0x7f0fe700e000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 4132356096 bytes == 0x7f0e530f2000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 tcmalloc: large alloc 5165449216 bytes == 0x7f0f495de000 @ 0x7f144f09c615 0x591f47 0x4cc229 0x4cc38b 0x566c91 0x5a4df1 0x630b1d 0x7f1443355950 0x7f1443359bf7 0x7f144368a7e8 0x7f14436401b3 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 ./finetune_pegasus_xsum.sh: line 15: 876 Killed

    I appreciate any help. Thank you.

  • Feature extraction for sequential labelling

    Feature extraction for sequential labelling

    Hi, I have a question in terms of using BERT for sequential labeling task. Please correct me if I'm wrong. My understanding is:

    1. Use BertModel loaded with pretrained weights instead of MaskedBertModel.
    2. In such case, take a sequence of tokens as input, BertModel would output a list of hidden states, I only use the top layer hidden states as the embedding for that sequence.
    3. Then to fine tune the model, add a linear fully connected layer and softmax to make final decision.

    Is this entire process correct? I followed this procedure but could not have any results.

    Thank you!

  • Sharded DDP training fails with seq2seq models

    Sharded DDP training fails with seq2seq models

    Information

    Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian

    The problem arises when using:

    • [x] the official example scripts: (give details below)
    • [ ] my own modified scripts: (give details below)

    The tasks I am working on is:

    • [x] an official GLUE/SQUaD task: seq2seq
    • [ ] my own task or dataset: (give details below)

    To reproduce

    Steps to reproduce the behavior:

    Run

    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
    --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
    ~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
    --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
    --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
    --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
    --n_train 500 --sharded_ddp
    

    will fail with

    Traceback (most recent call last):
    File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
    main()
    File "examples/seq2seq/finetune_trainer.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
    File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
    self.optimizer.step()
    File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
    self._broadcast_params()
    File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
    if self.should_bucket_param[param]:
    KeyError: Parameter containing:
    tensor([[-0.0296,  0.0038],
    [ 0.0000,  0.0000],
    [ 0.0298,  0.0385],
    ...,
    [-0.0161, -0.0024],
    [ 0.0022, -0.0576],
    [ 0.0053,  0.0256]], device='cuda:1')
    0%|   
    

    Using FP16 also fails.

    Expected behavior

    The script should run to completion.

  • Rename second input dimension for ONNX-supported CV models

    Rename second input dimension for ONNX-supported CV models

    What does this PR do?

    The second input dimension of pixel_values for CV models with ONNX support is currently named "sequence". This PR renames it to "num_channels".

    Also:

    • MobileViT is added to the list of models to test in tests/onnx/test_onnx_v2.py
    • for DETR, the second input dimension is removed for pixel_masks because there is actually no channel dimension (batch x height x width)
    • for MobileViT, a second input dimension is added to pixel_values so that it follows the same pattern as other vision models

    Before submitting

    • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
    • [x] Did you read the contributor guideline, Pull Request section?
    • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
    • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • [x] Did you write any new necessary tests?

    Who can review?

    Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

  • Grid search ProgressCallback leads to encoding issue on Windows

    Grid search ProgressCallback leads to encoding issue on Windows

    System Info

    • transformers version: 4.20.0
    • Platform: Windows-10-10.0.19041-SP0
    • Python version: 3.8.8
    • Huggingface_hub version: 0.7.0
    • PyTorch version (GPU?): 1.12.0+cu116 (True)

    Who can help?

    @richardliaw @amogkam

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • [ ] My own task or dataset (give details below)

    Reproduction

    When I use ray[tune] grid search on Windows, the script crashes due to encoding issues. The problem seems to originate in tqdm - so when the progress bar is reaching a certain point (right after 1%) the script will crash. The reason is that to plot the progress, TQDM makes use of different characters - some small/large blocks. But apparently not all of these work well on Windows.

    (pid=9060, ip=127.0.0.1, repr=_objective)
      File "python\ray\_raylet.pyx", line 665, in ray._raylet.execute_task
      File "python\ray\_raylet.pyx", line 669, in ray._raylet.execute_task
      File "python\ray\_raylet.pyx", line 616, in ray._raylet.execute_task.function_executor
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\_private\function_manager.py", line 675, in actor_method_executor
        return method(__ray_actor, *args, **kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\util\tracing\tracing_helper.py", line 462, in _resume_span
        return method(self, *_args, **_kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\trainable.py", line 360, in train
        result = self.step()
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\util\tracing\tracing_helper.py", line 462, in _resume_span
        return method(self, *_args, **_kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\function_runner.py", line 404, in step
        self._report_thread_runner_error(block=True)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\util\tracing\tracing_helper.py", line 462, in _resume_span
        return method(self, *_args, **_kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\function_runner.py", line 574, in _report_thread_runner_error
        raise e
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\function_runner.py", line 277, in run
        self._entrypoint()
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\function_runner.py", line 349, in entrypoint
        return self._trainable_func(
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\util\tracing\tracing_helper.py", line 462, in _resume_span
        return method(self, *_args, **_kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\function_runner.py", line 645, in _trainable_func
        output = fn()
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\integrations.py", line 288, in dynamic_modules_import_trainable
        return trainable(*args, **kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\utils\trainable.py", line 410, in inner
        trainable(config, **fn_kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\integrations.py", line 189, in _objective
        local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\trainer.py", line 1409, in train
        return inner_training_loop(
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\trainer.py", line 1726, in _inner_training_loop
        self.control = self.callback_handler.on_step_end(args, self.state, self.control)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\trainer_callback.py", line 369, in on_step_end
        return self.call_event("on_step_end", args, state, control)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\trainer_callback.py", line 388, in call_event
        result = getattr(callback, event)(
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\transformers\trainer_callback.py", line 472, in on_step_end
        self.training_bar.update(state.global_step - self.current_step)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\std.py", line 1256, in update
        self.refresh(lock_args=self.lock_args)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\std.py", line 1361, in refresh
        self.display()
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\std.py", line 1509, in display
        self.sp(self.__str__() if msg is None else msg)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\std.py", line 350, in print_status
        fp_write('\r' + s + (' ' * max(last_len[0] - len_s, 0)))
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\std.py", line 343, in fp_write
        fp.write(_unicode(s))
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\tqdm\utils.py", line 145, in inner
        return func(*args, **kwargs)
      File "C:\Users\bramv\.virtualenvs\transformers-finetuner-ah-81wJc\lib\site-packages\ray\tune\utils\util.py", line 228, in write
        self.stream2.write(*args, **kwargs)
      File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u258f' in position 6: character maps to <undefined>
    

    The reason that I post this in transformers is that I have never had any issues with the progress bars in the rest of the library, but something odd is going on with the ones that are present during grid search. I don't know the integration code base enough to find what might be causing this. Interestingly, in the rest of the library, a tqdm progress bar always takes up the whole screen and no issues happen (because for progress, tqdm can use large block characters). But for grid search, the progress bars seem a fixed, small width. That's why it requires different characters to plot the progress (smaller increments/blocks -> different characters). So if we can change the TQDM that is being used during grid search to be the same as the other ones in the library, than there should not be any issues I believe.

    Expected behavior

    No encoding issues, like in the rest of the library. This issue only occurs with grid search.

  • openai's CLIP model not working with pytorch 1.12 in some environments

    openai's CLIP model not working with pytorch 1.12 in some environments

    System Info

    • transformers version: 4.20.1
    • Platform: Linux-5.4.170+-x86_64-with-glibc2.31
    • Python version: 3.9.12
    • Huggingface_hub version: 0.8.1
    • PyTorch version (GPU?): 1.11.0+cu113 (True)
    • Tensorflow version (GPU?): 2.7.0 (False)
    • Flax version (CPU?/GPU?/TPU?): not installed (NA)
    • Jax version: not installed
    • JaxLib version: not installed
    • Using GPU in script?: yes
    • Using distributed or parallel set-up in script?: no

    Who can help?

    @patil-suraj

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • [X] My own task or dataset (give details below)

    Reproduction

    The following works as expected with torch 1.11, but generates the below error in version 1.12:

    import io
    import requests
    
    import torch
    from PIL import Image
    from transformers import CLIPModel, CLIPProcessor
    
    
    def load_image(bytes, max_width=100, max_height=100, force_rgb=True):
        """Create and optionally resize an image from bytes."""
        img = Image.open(io.BytesIO(bytes))
    
        width, height = img.size
        if width > max_width or height > max_height:
            img.thumbnail(size=(max_width, max_height))
    
        if img.mode != "RGB" and force_rgb:
            img = img.convert("RGB")
    
        return img
    
    urls = [
        "https://placekitten.com/408/287",
        "https://placekitten.com/200/138"
    ]
    
    images = [load_image(requests.get(url).content) for url in urls]
    
    name = "openai/clip-vit-base-patch32"
    proc = CLIPProcessor.from_pretrained(name)
    model = CLIPModel.from_pretrained(name)
    model.to(torch.device("cuda"))
    
    inputs = proc(images=images, return_tensors="pt").to(torch.device("cuda"))
    embeddings = model.get_image_features(**inputs).detach().cpu().numpy()
    

    This results in:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    Input In [1], in <cell line: 41>()
         38 model.to(torch.device("cuda"))
         40 inputs = proc(images=images, return_tensors="pt").to(torch.device("cuda"))
    ---> 41 embeddings = model.get_image_features(**inputs).detach().cpu().numpy()
    
    RuntimeError: CUDA error: an illegal memory access was encountered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    

    Here is how I test the different versions, keep all else the same:

    !pip uninstall -y torch torchvision torchaudio
    !pip install --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
    # !pip install --no-cache-dir torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
    

    And here some more info about the hardware environment:

    
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ System Report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ Linux                                โ”‚
    โ”‚ Linux-5.4.170+-x86_64-with-glibc2.31 โ”‚
    โ”‚                                      โ”‚
    โ”‚ CPUs                                 โ”‚
    โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
    โ”‚ โ”‚ cores    โ”‚      # โ”‚                โ”‚
    โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                โ”‚
    โ”‚ โ”‚ logical  โ”‚      2 โ”‚                โ”‚
    โ”‚ โ”‚ physical โ”‚      1 โ”‚                โ”‚
    โ”‚ โ”‚ usable   โ”‚ [0, 1] โ”‚                โ”‚
    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
    โ”‚ RAM                                  โ”‚
    โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
    โ”‚ โ”‚ kind      โ”‚   gb โ”‚                 โ”‚
    โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค                 โ”‚
    โ”‚ โ”‚ total     โ”‚  7.3 โ”‚                 โ”‚
    โ”‚ โ”‚ available โ”‚  5.6 โ”‚                 โ”‚
    โ”‚ โ”‚ used      โ”‚  1.5 โ”‚                 โ”‚
    โ”‚ โ”‚ free      โ”‚  3.1 โ”‚                 โ”‚
    โ”‚ โ”‚ active    โ”‚  2.7 โ”‚                 โ”‚
    โ”‚ โ”‚ inactive  โ”‚  1.1 โ”‚                 โ”‚
    โ”‚ โ”‚ buffers   โ”‚  0.4 โ”‚                 โ”‚
    โ”‚ โ”‚ cached    โ”‚  2.4 โ”‚                 โ”‚
    โ”‚ โ”‚ shared    โ”‚  0.0 โ”‚                 โ”‚
    โ”‚ โ”‚ slab      โ”‚  0.3 โ”‚                 โ”‚
    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
    โ”‚ Disk (/home)                         โ”‚
    โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
    โ”‚ โ”‚ kind  โ”‚   gb โ”‚                     โ”‚
    โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค                     โ”‚
    โ”‚ โ”‚ total โ”‚ 48.9 โ”‚                     โ”‚
    โ”‚ โ”‚ used  โ”‚  2.3 โ”‚                     โ”‚
    โ”‚ โ”‚ free  โ”‚ 46.6 โ”‚                     โ”‚
    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
    โ”‚ GPU                                  โ”‚
    โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
    โ”‚ โ”‚ property       โ”‚           value โ”‚ โ”‚
    โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚
    โ”‚ โ”‚ name           โ”‚       Tesla K80 โ”‚ โ”‚
    โ”‚ โ”‚ driver_version โ”‚      450.119.04 โ”‚ โ”‚
    โ”‚ โ”‚ vbios_version  โ”‚  80.21.25.00.04 โ”‚ โ”‚
    โ”‚ โ”‚ memory.total   โ”‚       11441 MiB โ”‚ โ”‚
    โ”‚ โ”‚ memory.free    โ”‚       11438 MiB โ”‚ โ”‚
    โ”‚ โ”‚ memory.used    โ”‚           3 MiB โ”‚ โ”‚
    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
    โ”‚ Packages                             โ”‚
    โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
    โ”‚ โ”‚ Package      โ”‚      Version โ”‚      โ”‚
    โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค      โ”‚
    โ”‚ โ”‚ numpy        โ”‚       1.22.0 โ”‚      โ”‚
    โ”‚ โ”‚ torch        โ”‚ 1.12.0+cu113 โ”‚      โ”‚
    โ”‚ โ”‚ transformers โ”‚       4.20.1 โ”‚      โ”‚
    โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

    Expected behavior

    The code should run without CUDA errors.

  • XLA train step fixes

    XLA train step fixes

    This PR makes a bunch of changes to the TF codebase to improve XLA support, in preparation for our upcoming big TF release. The goal is to allow users to use jit_compile on the vast majority of our models, which should yield large performance improvements for TF. In particular:

    • Rewrites to the train_step and test_step so that any mutable Python input dicts are not modified in the step. This was a bad idea anyway, but it causes particular problems with XLA, which is very functional and hates side effects, like JAX.
    • Rewrites to the common hf_compute_loss functions to ensure that static shapes are maintained throughout, so that XLA compilation is possible.
    • Add a test to ensure that we can still fit models when XLA compilation is used. XLA compilation is quite expensive, which makes this test quite slow, so it's restricted to core models for now and tagged as @slow.

    Left to do:

    • [ ] Fix XLA-incompatible model-specific hf_compute_loss functions. On a quick search it looked like there were 4-5 of these, so it shouldn't take too long. Any use of tf.boolean_mask is a surefire sign that XLA compilation will break, because output shapes become data-dependent.
    • [ ] See if there's a way to test non-core models for XLA fit support without crippling performance.
  • Skip a particular exception in `test_sample_generate`

    Skip a particular exception in `test_sample_generate`

    What does this PR do?

    A continuation of #17937 to fix a CI failure

                # sample
                probs = nn.functional.softmax(next_token_scores, dim=-1)
    >           next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
    E           RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
    

    As @patrickvonplaten , when a broken generation happens due to all -inf scores along the vocab dimension, nothing we can do. This is likely to happen only with random models however.

    Let's say goodbye to this flaky situation!

  • TrainingArguments does not support `mps` device (Mac M1 GPU)

    TrainingArguments does not support `mps` device (Mac M1 GPU)

    System Info

    • transformers version: 4.21.0.dev0
    • Platform: macOS-12.4-arm64-arm-64bit
    • Python version: 3.8.9
    • Huggingface_hub version: 0.8.1
    • PyTorch version (GPU?): 1.12.0 (False)
    • Tensorflow version (GPU?): not installed (NA)
    • Flax version (CPU?/GPU?/TPU?): not installed (NA)
    • Jax version: not installed
    • JaxLib version: not installed
    • Using GPU in script?: yes
    • Using distributed or parallel set-up in script?: no

    Who can help?

    @sgugger

    Information

    • [X] The official example scripts
    • [ ] My own modified scripts

    Tasks

    • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • [ ] My own task or dataset (give details below)

    Reproduction

    export TASK_NAME=wnli
    python run_glue.py \
      --model_name_or_path bert-base-cased \
      --task_name $TASK_NAME \
      --do_train \
      --do_eval \
      --max_seq_length 128 \
      --per_device_train_batch_size 32 \
      --learning_rate 2e-5 \
      --num_train_epochs 3 \
      --output_dir /tmp/$TASK_NAME/
    

    Expected behavior

    When running the Trainer.train on a machine with an MPS GPU, it still just uses the CPU. I expected it to use the MPS GPU. This is supported by torch in the newest version 1.12.0, and we can check if the MPS GPU is available using torch.backends.mps.is_available().

    It seems like the issue lies in the TrainingArguments._setup_devices method, which doesn't appear to allow for the case where device = "mps".

๐Ÿค— Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
๐Ÿค— Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | ็ฎ€ไฝ“ไธญๆ–‡ | ็น้ซ”ไธญๆ–‡ State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Jul 6, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

Jul 1, 2022
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFaceโ€™s Modelhub and much more!
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFaceโ€™s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intelยฎ AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Jun 29, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intelยฎ AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intelยฎ AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Jun 27, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Feb 7, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

Jun 30, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Jul 3, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

Feb 18, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Feb 18, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

Jun 29, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Jul 4, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

Jun 25, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

Feb 17, 2021
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Jun 20, 2022
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Jul 1, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language โš–๏ธ The library of Natural Language Processing for Brazilian legal lang

Jun 19, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

May 25, 2022