A very simple framework for state-of-the-art Natural Language Processing (NLP)

alt text

PyPI version GitHub Issues Contributions welcome License: MIT

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends.


Flair is:

  • A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages.

  • A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.

  • A PyTorch NLP framework. Our framework builds directly on PyTorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.

Now at version 0.8!

State-of-the-Art Models

Flair ships with state-of-the-art models for a range of NLP tasks. For instance, check out our latest NER models:

Language Dataset Flair Best published Model card & demo
English Conll-03 (4-class) 94.09 94.3 (Yamada et al., 2018) Flair English 4-class NER demo
English Ontonotes (18-class) 90.93 91,3 (Yu et al., 2016) Flair English 18-class NER demo
German Conll-03 (4-class) 92,31 90.3 (Yu et al., 2016) Flair German 4-class NER demo
Dutch Conll-03 (4-class) 95,25 93.7 (Yu et al., 2016) Flair Dutch 4-class NER demo
Spanish Conll-03 (4-class) 90,54 90.3 (Yu et al., 2016) Flair Spanish 18-class NER demo

New: Most Flair sequence tagging models (named entity recognition, part-of-speech tagging etc.) are now hosted on the 🤗 HuggingFace model hub! You can browse models, check detailed information on how they were trained, and even try each model out online!

Quick Start

Requirements and Installation

The project is based on PyTorch 1.5+ and Python 3.6+, because method signatures and type hints are beautiful. If you do not have Python 3.6, install it first. Here is how for Ubuntu 16.04. Then, in your favorite virtual environment, simply do:

pip install flair

Example Usage

Let's run named entity recognition (NER) over an example sentence. All you need to do is make a Sentence, load a pre-trained model and use it to predict tags for the sentence:

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('I love Berlin .')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

Done! The Sentence now has entity annotations. Print the sentence to see what the tagger found.

print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

This should print:

Sentence: "I love Berlin ." - 4 Tokens

The following NER tags are found:

Span [3]: "Berlin"   [− Labels: LOC (0.9992)]

Tutorials

We provide a set of quick tutorials to get you started with the library:

The tutorials explain how the base NLP classes work, how you can load pre-trained models to tag your text, how you can embed your text with different word or document embeddings, and how you can train your own language models, sequence labeling models, and text classification models. Let us know if anything is unclear.

There is also a dedicated landing page for our biomedical NER and datasets with installation instructions and tutorials.

There are also good third-party articles and posts that illustrate how to use Flair:

Citing Flair

Please cite the following paper when using Flair embeddings:

@inproceedings{akbik2018coling,
  title={Contextual String Embeddings for Sequence Labeling},
  author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
  booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
  pages     = {1638--1649},
  year      = {2018}
}

If you use the Flair framework for your experiments, please cite this paper:

@inproceedings{akbik2019flair,
  title={FLAIR: An easy-to-use framework for state-of-the-art NLP},
  author={Akbik, Alan and Bergmann, Tanja and Blythe, Duncan and Rasul, Kashif and Schweter, Stefan and Vollgraf, Roland},
  booktitle={{NAACL} 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)},
  pages={54--59},
  year={2019}
}

If you use the pooled version of the Flair embeddings (PooledFlairEmbeddings), please cite this paper:

@inproceedings{akbik2019naacl,
  title={Pooled Contextualized Embeddings for Named Entity Recognition},
  author={Akbik, Alan and Bergmann, Tanja and Vollgraf, Roland},
  booktitle = {{NAACL} 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  pages     = {724–728},
  year      = {2019}
}

If you use our new "FLERT" models or approach, please cite this paper:

@misc{schweter2020flert,
    title={FLERT: Document-Level Features for Named Entity Recognition},
    author={Stefan Schweter and Alan Akbik},
    year={2020},
    eprint={2011.06993},
    archivePrefix={arXiv},
    primaryClass={cs.CL}

Contact

Please email your questions or comments to Alan Akbik.

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

For contributors looking to get deeper into the API we suggest cloning the repository and checking out the unit tests for examples of how to call methods. Nearly all classes and methods are documented, so finding your way around the code should hopefully be easy.

Running unit tests locally

You need Pipenv for this:

pipenv install --dev && pipenv shell
pytest tests/

To run integration tests execute:

pytest --runintegration tests/

The integration tests will train small models. Afterwards, the trained model will be loaded for prediction.

To also run slow tests, such as loading and using the embeddings provided by flair, you should execute:

pytest --runslow tests/

License

The MIT License (MIT)

Flair is licensed under the following MIT license: The MIT License (MIT) Copyright © 2018 Zalando SE, https://tech.zalando.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
flair
Comments
  • Flair 0.5 features

    Flair 0.5 features

    Here, I'd like to collect some ideas for features that we would like to see in the next version of Flair.

    Ideas:

    • [x] Refactor data loading methods. We currently load the entire training data set into memory, but this is a problem for large datasets (#458 #457) and may also cause bottlenecks in GPU usage. Idea is to use the DataLoader abstraction (as currently used in the LanguageModelTrainer) for asynchronous loading from disk. This should make training over large datasets possible and may also significantly improve training speed.
    • [x] Refactor flair.nn.Model and ModelTrainer. The ModelTrainer currently supports training SequenceLabeler and TextClassification classes, but community members have suggested other tasks, such as regression (#440) or seq2seq (#560). The flair.nn.Model interface needs to be simplified (fewer methods) and generalized in such a way that implementing this interface will immediately enable training using the ModelTrainer class (see also #474).
    • [ ] Multi-Task Learning: This one has been on our list for a while, but we'd like to add simple methods for training multiple tasks at the same time. To do this, we may need to refactor the embeddings classes to make it easier to expose internal states (see #524).
    • [x] Tokenization. Right now, we use segtok for tokenization, but maybe we can include other tokenizers (#394), perhaps even our own trained over the UD corpora.
    • [ ] Multi-GPU support: With the changes to the new CUDA semantics introduced in 0.4.1 we can now look into multi-GPU support.

    Any other ideas? Please let us know!

    A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)

  • Comparison between BERT, ELMo, and Flair embeddings

    Comparison between BERT, ELMo, and Flair embeddings

    We want to collect experiments here that compare BERT, ELMo, and Flair embeddings. So if you have any findings on which embedding type work best on what kind of task, we would be more than happy if you share your results. We are also going to run some experiments and share our results here.

  • pytorch-pretrained-bert to pytorch-transformers upgrade

    pytorch-pretrained-bert to pytorch-transformers upgrade

    Hi,

    the upcoming 1.0 version of pytorch-pretrained-bert will introduce several API changes, new models and even a name change to pytorch-transformers.

    After the final 1.0 release, flair could support 7 different Transformer-based architectures:

    • [x] BERT -> BertEmbeddings
    • [x] OpenAI GPT -> OpenAIGPTEmbeddings
    • [x] OpenAI GPT-2 -> OpenAIGPT2Embeddings 🛡️
    • [x] Transformer-XL -> TransformerXLEmbeddings
    • [x] XLNet -> XLNetEmbeddings 🛡️
    • [x] XLM -> XLMEmbeddings 🛡️
    • [x] RoBERTa -> RoBERTaEmbeddings 🛡️ (currently not covered by pytorch-transformers)

    🛡️ indicates a new embedding class for flair.

    It also introduces an universal API for all models, so quite a few changes in flair are necessary so support both old and new embedding classes.

    This issue tracks the implementation status for all 6 embedding classes 😊

  • Spanish LM

    Spanish LM

    Hello, I just trained a Spanish LM. I wonder if it is a good enough one. What are the ways for you to test if it is a good enough LM? For example, what do you get for loss in the English model? What does ppl stand for?

    This is what I got for the very last split.

    Split 10 - (08:27:57) (08:29:14) | split 10 / 9 | 100/ 555 batches | ms/batch 11655.58 | loss 1.37 | ppl 3.95 | split 10 / 9 | 200/ 555 batches | ms/batch 11570.46 | loss 1.36 | ppl 3.89 | split 10 / 9 | 300/ 555 batches | ms/batch 11550.08 | loss 1.35 | ppl 3.88 | split 10 / 9 | 400/ 555 batches | ms/batch 11563.46 | loss 1.35 | ppl 3.86 | split 10 / 9 | 500/ 555 batches | ms/batch 11523.42 | loss 1.35 | ppl 3.86 training done! (10:16:09) best loss so far 1.26

    | end of split 1 / 9 | epoch 0 | time: 7542.77s | valid loss 1.26 | valid ppl 3.52 | learning rate 20.00

  • Multilingual NER

    Multilingual NER

    When I use multilingual embedding for NER till now I have used the input text only in English and the model gave me good inference results in Spanish, however now I have also small number of tagged samples in Spanish (less than 1000 sentences) which number is too low to build a Spanish NER model. Thus, my question is if I can combine both english and spanish samples for training? Is this possible and any thoughts about the accuracy of this kind of mixed language training data.

    Remark: The entities which I use are not the standard ones, I have custom entities to train the NER model.

    Thanks in advance, Igor

  • Error when creating embeddings - HEAD request to S3 bucket returns 404

    Error when creating embeddings - HEAD request to S3 bucket returns 404

    Hello, I have a problem loading Word/FlairEmbeddings for English and German languages located at the urls: "https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/xxxxxxxxx.pt".

    When following Tutorial 3, trying to create these embeddings gives the following error:

    OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy with status code 404

    Making a simple HEAD request to that url outside of flair returns 404, so it looks like the embeddings are not located there anymore ?

    To Reproduce

    from flair.embeddings import WordEmbeddings glove_embedding = WordEmbeddings('glove')

    I would appreciate your help, Thank you !

  • Arabic LM

    Arabic LM

    Hello, I tried to generate a Language model for Arabic using Flair, but it seems not working as expected. I used the Leipzig Corpora Collection as my training corpus. It contains 1M sentences in Arabic. Here is the code used for the training: ############### language_model = LanguageModel(dictionary, is_forward_lm, hidden_size=512, nlayers=1)

    train your language model

    trainer = LanguageModelTrainer(language_model, corpus)

    trainer.train('resources/taggers/language_model', sequence_length=250, mini_batch_size=100, max_epochs=10) ############## Once finished, when I try to generate text via the script provided with Flair, I got this display: text generation arabic

    Should I preprocess my dataset before the training or It's just an underfitting issue? Please advice on this?

  • GH-1021: Compute Flair embeddings over original string and without spaces after token

    GH-1021: Compute Flair embeddings over original string and without spaces after token

    Now, space is inserted after every token despite whitespace_after value. So sentences being embedded do not look the same as normal text on which Flair language models (embeddings) were trained. Also, hidden state is taken as embedding after consuming a space.

    This PR computes embeddings over text with original white spaces. I am not sure if this approach is better in downstream tasks, but it is consistent with trained Flair language model.

    Initial experiments shows that calculating token embedding without space gives better results. Using whitespace_after information is a little worse. First epoch of training with fixed seeds:

    2020-01-18 21:54:41,014 epoch 1 - iter 43/431 - loss 12.58697907 - samples/sec: 104.66
    2020-01-18 21:54:54,871 epoch 1 - iter 86/431 - loss 8.39764844 - samples/sec: 99.44
    2020-01-18 21:55:08,917 epoch 1 - iter 129/431 - loss 6.48929740 - samples/sec: 98.12
    2020-01-18 21:55:23,167 epoch 1 - iter 172/431 - loss 5.37756264 - samples/sec: 96.67
    2020-01-18 21:55:36,985 epoch 1 - iter 215/431 - loss 4.63144744 - samples/sec: 99.72
    2020-01-18 21:55:50,898 epoch 1 - iter 258/431 - loss 4.14733414 - samples/sec: 99.01
    2020-01-18 21:56:04,915 epoch 1 - iter 301/431 - loss 3.76650934 - samples/sec: 98.27
    2020-01-18 21:56:18,543 epoch 1 - iter 344/431 - loss 3.50686799 - samples/sec: 101.09
    2020-01-18 21:56:32,710 epoch 1 - iter 387/431 - loss 3.27343043 - samples/sec: 97.24
    2020-01-18 21:56:46,579 epoch 1 - iter 430/431 - loss 3.07420425 - samples/sec: 99.32
    2020-01-18 21:56:46,824 ----------------------------------------------------------------------------------------------------
    2020-01-18 21:56:46,824 EPOCH 1 done: loss 3.0724 - lr 0.1000
    2020-01-18 21:57:00,554 DEV : loss 0.501313328742981 - score 0.9818
    

    vs. with space after every token:

    2020-01-18 21:50:48,533 epoch 1 - iter 43/431 - loss 12.42667790 - samples/sec: 103.46
    2020-01-18 21:51:02,377 epoch 1 - iter 86/431 - loss 8.29389394 - samples/sec: 99.53
    2020-01-18 21:51:16,640 epoch 1 - iter 129/431 - loss 6.39272680 - samples/sec: 96.59
    2020-01-18 21:51:31,172 epoch 1 - iter 172/431 - loss 5.28999498 - samples/sec: 94.78
    2020-01-18 21:51:45,272 epoch 1 - iter 215/431 - loss 4.54830917 - samples/sec: 97.70
    2020-01-18 21:51:59,159 epoch 1 - iter 258/431 - loss 4.06829357 - samples/sec: 99.19
    2020-01-18 21:52:13,226 epoch 1 - iter 301/431 - loss 3.69254654 - samples/sec: 97.92
    2020-01-18 21:52:26,832 epoch 1 - iter 344/431 - loss 3.43669727 - samples/sec: 101.30
    2020-01-18 21:52:41,262 epoch 1 - iter 387/431 - loss 3.21065094 - samples/sec: 95.45
    2020-01-18 21:52:55,442 epoch 1 - iter 430/431 - loss 3.01424922 - samples/sec: 97.19
    2020-01-18 21:52:55,693 ----------------------------------------------------------------------------------------------------
    2020-01-18 21:52:55,693 EPOCH 1 done: loss 3.0125 - lr 0.1000
    2020-01-18 21:53:09,711 DEV : loss 0.46862271428108215 - score 0.9831
    

    vs. original code (embedding includes space after):

    2020-01-18 19:11:18,763 epoch 1 - iter 43/431 - loss 13.77513606 - samples/sec: 100.09
    2020-01-18 19:11:32,758 epoch 1 - iter 86/431 - loss 9.86612977 - samples/sec: 98.49
    2020-01-18 19:11:46,444 epoch 1 - iter 129/431 - loss 7.84115263 - samples/sec: 100.66
    2020-01-18 19:12:00,549 epoch 1 - iter 172/431 - loss 6.57365022 - samples/sec: 97.65
    2020-01-18 19:12:14,326 epoch 1 - iter 215/431 - loss 5.71325676 - samples/sec: 99.98
    2020-01-18 19:12:28,293 epoch 1 - iter 258/431 - loss 5.13455269 - samples/sec: 98.62
    2020-01-18 19:12:42,390 epoch 1 - iter 301/431 - loss 4.66878698 - samples/sec: 97.71
    2020-01-18 19:12:55,841 epoch 1 - iter 344/431 - loss 4.35207718 - samples/sec: 102.41
    2020-01-18 19:13:09,813 epoch 1 - iter 387/431 - loss 4.08306071 - samples/sec: 98.59
    2020-01-18 19:13:23,596 epoch 1 - iter 430/431 - loss 3.83598182 - samples/sec: 99.94
    2020-01-18 19:13:23,847 ----------------------------------------------------------------------------------------------------
    2020-01-18 19:13:23,847 EPOCH 1 done: loss 3.8318 - lr 0.1000
    2020-01-18 19:13:37,732 DEV : loss 0.9926474690437317 - score 0.9577
    

    You can test it on example train.py with POS tagging on UD after applying https://github.com/flairNLP/flair/pull/1361.

  • Unable to load embeddings

    Unable to load embeddings

    from flair.embeddings import FlairEmbeddings, BertEmbeddings

    init Flair embeddings

    flair_forward_embedding = FlairEmbeddings('multi-forward') flair_backward_embedding = FlairEmbeddings('multi-backward')

    init multilingual BERT

    bert_embedding = BertEmbeddings('bert-base-multilingual-cased')


    AttributeError Traceback (most recent call last) in 2 3 # init Flair embeddings ----> 4 flair_forward_embedding = FlairEmbeddings('multi-forward') 5 flair_backward_embedding = FlairEmbeddings('multi-backward') 6

    C:\Anaconda3\lib\site-packages\flair\embeddings.py in init(self, model, detach, use_cache, cache_directory) 562 self.static_embeddings = detach 563 --> 564 from flair.models import LanguageModel 565 self.lm = LanguageModel.load_language_model(model) 566 self.detach = detach

    C:\Anaconda3\lib\site-packages\flair_init_.py in 1 from . import data ----> 2 from . import models 3 from . import visual 4 from . import trainers 5

    C:\Anaconda3\lib\site-packages\flair\models_init_.py in ----> 1 from .sequence_tagger_model import SequenceTagger 2 from .language_model import LanguageModel 3 from .text_classification_model import TextClassifier

    C:\Anaconda3\lib\site-packages\flair\models\sequence_tagger_model.py in 64 65 ---> 66 class SequenceTagger(flair.nn.Model): 67 68 def init(self,

    AttributeError: module 'flair' has no attribute 'nn'

  • transformer models for language model training and tag prediction instead of LSTM's

    transformer models for language model training and tag prediction instead of LSTM's

    I recently read the generative pretraining paper of openAI. According to the benchmarks, fine-tuning the openAI model on a custom dataset takes a very less amount of time compared to a LSTM based approach. Also the model has shown to improve SOTA in a lot of tasks. So I was wondering if it is possible to replace the pipeline by a transformer based model implemented by OpenAI.

  • Support for more languages?

    Support for more languages?

    Hi! Flair looks amazing. Clean code, easy to use. Thanks for making it open source!

    I was wondering if you plan to add support for more languages? Maybe all the languages whereZalando operates? :) I'm working for a company that need NLP-code that works across pretty much the same set of countries.

    Looking at different available libraries, pre-trained models for more than just English (and German in this case!), is lacking in all the other libraries.

  • Load own corpus and train task specific embedding

    Load own corpus and train task specific embedding

    Hi all,

    I have a corpus dataset, which contains text in the first column and a certainty label in the second column. I would like to customize a model that predicts the certainty labels. The data is already split in train, dev and test and all are located in the corpus folder as required. I have the files as csv and txt files. For both the separator is tab. As it worked for version 0.10 i tried the following:

    from flair.datasets import CSVClassificationCorpus

    this is the folder in which train, test and dev files reside

    data_folder = '/path/to/corpus'

    column_name_map = {0: "Text", 1:"certainty_Label"}
    label_type = "certainty_label"

    corpus: Corpus = CSVClassificationCorpus(data_folder, column_name_map, skip_header=True, delimiter="\t", label_type=label_type)

    This however results in the following error: [...] -> 1209 raise RuntimeError("No data provided when initializing corpus object.") 1211 # sample test data from train if none is provided 1212 if test is None and sample_missing_splits and train and not sample_missing_splits == "only_dev":

    RuntimeError: No data provided when initializing corpus object.

    I would like to train a sentence classification model, that predicts the authors certainty for a written sentence. Therefore I want to train a model with specific embeddings for this use case. Any recommendation about fixing my error and alternative approaches to load my corpus, embed it and train the model are appreciated. Many thanks :)

    P.S. a small example of the trainingdataset is attached example_trainingset.csv

  • Error found

    Error found

    2022-12-22 18:13:03,186 epoch 1 - iter 143/1434 - loss 0.91632178 - samples/sec: 16.81 - lr: 0.100000 Traceback (most recent call last): File "trainer.py", line 48, in max_epochs=150) File "/home/ubuntu/.local/lib/python3.6/site-packages/flair/trainers/trainer.py", line 500, in train loss = self.model.forward_loss(batch_step) File "/home/ubuntu/.local/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 270, in forward_loss scores, gold_labels = self.forward(sentences) # type: ignore File "/home/ubuntu/.local/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 290, in forward sentence_tensor = sentence_tensor[length_indices] RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

  • Support for Vietnamese

    Support for Vietnamese

        Hi, I am looking through Flair and wondering if it support Vietnamese or not. If not, will it in the future? Thank you!
    

    Originally posted by @longsc2603 in https://github.com/flairNLP/flair/issues/2#issuecomment-1354413764

  • Cannot export Onnx-Embeddings

    Cannot export Onnx-Embeddings

    Describe the bug At first, there is an issue with the parameters not being correct, this can easily solved by converting the list to a tuple at https://github.com/flairNLP/flair/blob/master/flair/embeddings/transformer.py#L780 : embedding, (example_tensors,), ... However then I get an exception

    RuntimeError: r INTERNAL ASSERT FAILED at "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen/core/jit_type_base.h":545, please report a bug to PyTorch
    

    That error is created at line input.type().scalarType() is not None where input.type() is of type List[Tensor]. I suppose that this is due to an invalid variable that is currently not supported and requires some serve debugging.

    To Reproduce Follow the onnx tutorial on https://github.com/flairNLP/flair/commit/117e14e5d821a1a4a655ccbbe495fc380ecebb7a (current master)

    Expected behavior Onnx export works as expected and won't throw any exception.

    Environment (please complete the following information): OS: Windows flair on https://github.com/flairNLP/flair/commit/117e14e5d821a1a4a655ccbbe495fc380ecebb7a transformers on 4.20.1 torch on 1.12.0+cu116 onnxruntime on 1.11.1

  • Corpus incorrectly aligning spans from Flair 0.11

    Corpus incorrectly aligning spans from Flair 0.11

    Hello,

    First of all, thanks for making this library publicly available!

    Description of the bug When using Flair 0.11 and higher, I noticed that in some documents my spans were not aligned as to how they are represented in the data. Downgrading to Flair 0.10 or lower seemed to fixed that issue.

    How to Reproduce I created an example txt where the issue is prevalent. Using the following example and code snippet should allow you to reproduce the issue.

    Create some fake ner data:

    example_txt = "George B-NAME\n"
    example_txt += "Washington I-NAME\n"
    example_txt += "went O\n"
    example_txt += "\t O\n"
    example_txt += "Washington B-CITY\n"
    example_txt += "and O\n"
    example_txt += "enjoyed O\n"
    example_txt += "some O\n"
    example_txt += "coffee B-BEVERAGE\n"
    with open("notebooks/example.txt", "w", encoding="utf-8") as file_out:
        file_out.write(example_txt)
    

    This creates the following file: image

    Load in the generated data.

    from flair.data import Corpus, Sentence
    from flair.datasets import ColumnCorpus
    columns: dict = {0: "text", 1: "ner"}
    corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt")
    
    sentence: Sentence = corpus.train[0]
    for span in sentence.get_spans("ner"):
        print(span)
    
    >>>Span[0:2]: "George Washington" → NAME (1.0)
    >>>Span[5:6]: "and" → CITY (1.0)
    

    "And" incorrectly received the CITY span, coffee is not listed as an entity. I assume this is due to the \t being matched as a column seperator.

    Expected behavior Using the exact same code snippet in Flair 0.10 or lower I get the following (which is also what I expect to have).

    for span in sentence.get_spans("ner"):
        print(span)
    
    >>>[<NAME-span (1,2): "George Washington">,
    >>> <CITY-span (5): "Washington">,
    >>> <BEVERAGE-span (9): "Coffee">]
    

    Environment (please complete the following information):

    • Flair >= 0.11:

    How I managed to load in my data correctly This issue is fixable by specifying the column delimiter.

    from flair.data import Corpus, Sentence
    from flair.datasets import ColumnCorpus
    columns: dict = {0: "text", 1: "ner"}
    corpus: Corpus = ColumnCorpus(data_folder="data", column_format=columns, train_file="example.txt", column_delimiter = " ")
    
    sentence: Sentence = corpus.train[0]
    for span in sentence.get_spans("ner"):
        print(span)
    
    >>>Span[0:2]: "George Washington" → NAME (1.0)
    >>>Span[4:5]: "Washington" → CITY (1.0)
    >>>Span[8:9]: "coffee" → BEVERAGE (1.0)
    

    Why am I listing it as a bug if it's fixable? I spent over a week debugging this issue. The code runs, the model "seems" to learn, loss decreases but F1 stays 0, the data seems correct. It's very difficult to spot this almost "invisible" issue that I believe others will overlook as well. Moreover, Flair <= 0.10 gracefully resolves this issue (I assume there is a failsafe check somewhere).

    If you have any questions, please shoot and I will get back to you quickly.

    Kind regards, Guust

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

Jan 2, 2023
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

Feb 18, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Jan 3, 2023
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Jan 5, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Jan 2, 2023
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Feb 18, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Feb 18, 2021
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Jan 3, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Dec 31, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

Dec 5, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Feb 7, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

May 25, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Jan 2, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Jan 1, 2023
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Feb 13, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Feb 3, 2021
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Feb 18, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Feb 18, 2021