Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages

The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. For detailed information please visit our official website.

🔥  A new collection of biomedical and clinical English model packages are now available, offering seamless experience for syntactic analysis and named entity recognition (NER) from biomedical literature text and clinical notes. For more information, check out our Biomedical models documentation page.

References

If you use this library in your research, please kindly cite our ACL2020 Stanza system demo paper:

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

If you use our biomedical and clinical models, please also cite our Stanza Biomedical Models description paper:

@article{zhang2020biomedical,
  title={Biomedical and Clinical English Model Packages in the Stanza Python NLP Library},
  author={Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D. and Langlotz, Curtis P.},
  journal={arXiv preprint arXiv:2007.14640},
  year={2020}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi, Yuhao Zhang, and Yuhui Zhang, with help from Jason Bolton, Tim Dozat and John Bauer. Maintenance of this repo is currently led by John Bauer.

If you use the CoreNLP software through Stanza, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

Issues and Usage Q&A

To ask questions, report issues or request features 🤔 , please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem, or visit the Frequently Asked Questions (FAQ) page on our website.

Contributing to Stanza

We welcome community contributions to Stanza in the form of bugfixes 🛠️ and enhancements 💡 ! If you want to contribute, please first read our contribution guideline.

Installation

pip

Stanza supports Python 3.6 or later. We recommend that you install Stanza via pip, the Python package manager. To install, simply run:

pip install stanza

This should also help resolve all of the dependencies of Stanza, for instance PyTorch 1.3.0 or above.

If you currently have a previous version of stanza installed, use:

pip install stanza -U

Anaconda

To install Stanza via Anaconda, use the following conda command:

conda install -c stanfordnlp stanza

Note that for now installing Stanza via Anaconda does not work for Python 3.8. For Python 3.8 please use pip installation.

From Source

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza. For this option, run

git clone https://github.com/stanfordnlp/stanza.git
cd stanza
pip install -e .

Running Stanza

Getting Started with the neural pipeline

To run your first Stanza pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanza
>>> stanza.download('en')       # This downloads the English models for the neural pipeline
>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

See our getting started guide for more details.

Accessing Java Stanford CoreNLP software

Aside from the neural pipeline, this package also includes an official wrapper for accessing the Java Stanford CoreNLP software with Python code.

There are a few initial setup steps.

  • Download Stanford CoreNLP and models for the language you wish to use
  • Put the model jars in the distribution folder
  • Tell the Python code where Stanford CoreNLP is located by setting the CORENLP_HOME environment variable (e.g., in *nix): export CORENLP_HOME=/path/to/stanford-corenlp-4.1.0

We provide comprehensive examples in our documentation that show how one can use CoreNLP through Stanza and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

  • Go to the Google Colab website
  • Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
  • Note that you do not need to give Colab access permission to your github account
  • Type stanfordnlp/stanza in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the Universal Dependencies treebanks v2.5, as well as NER models for a few widely-spoken languages. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library can be trained with your own data. The tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser require CoNLL-U formatted data, while the NER model requires the BIOES format. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and run training from the source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

Stanza is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Comments
  • google.protobuf.message.DecodeError: Error parsing message

    google.protobuf.message.DecodeError: Error parsing message

    Description I think this is similar to a bug in the old python library: python-stanford-corenlp. I'm trying to copy the demo for the client hereor here. but with my own texts... text2 works and text3 doesn't, the only differemce between them in the very last word.

    The error I get is:

    Traceback (most recent call last):
      File "C:/gitProjects/patentmoto2/scratch4.py", line 23, in <module>
        ann = client.annotate(text)
      File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 403, in annotate
        parseFromDelimitedString(doc, r.content)
      File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
        obj.ParseFromString(buf[offset+pos:offset+pos+size])
    google.protobuf.message.DecodeError: Error parsing message
    

    To Reproduce

    Steps to reproduce the behavior:

    
    print('---')
    print('input text')
    print('')
    
    text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
    text2 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and."
    text3 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and his."
    
    text = text3
    print(text)
    
    
    print('---')
    print('starting up Java Stanford CoreNLP Server...')
    
    
    with CoreNLPClient(endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                       timeout=70000, memory='16G', threads=10, be_quiet=False) as client:
    
        ann = client.annotate(text)
    
    
        sentence = ann.sentence[0]
    
    
        print('---')
        print('constituency parse of first sentence')
        constituency_parse = sentence.parseTree
        print(constituency_parse)
    

    Expected behavior I expect it to finish. text=text2 succeeds, but text=text3 fails with the above error. The only difference between the texts is the last word 'his' (could really be anything I think).

    Environment:

    • OS: Windows 10
    • Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
    • CoreNLP 3.9.2
    • corenlp-protobuf==3.8.0
    • protobuf==3.10.0
    • stanfordnlp==0.2.0
    • torch==1.1.0

    Additional context I've also gotten a timeout error for some sentences, but it's intermittent. I'm not sure of they're related, but this is easier to reproduce.

  • FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    Hi, a couple of questions that are related.

    I'm trying to train a new model for a new language, but I'm first trying the data included in the packages to know more about how Stanza works when training data.

    When I run the command

    python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST

    the following error appears:

    (nlp) [email protected] oe_lemmatizer_stanza % python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST 2022-06-27 16:45:52 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1136, in <module> main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 134, in main process_treebank(treebank, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1116, in process_treebank train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail=True) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 37, in find_treebank_dataset_file raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename)) FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    The path I am using is the exact one that comes with the package when cloning it from GitHub. My idea is to replace the files with my own ones. I have tried closed issues about some similar errors to this one, but the solutions are not applicable to my problem.

    Also, I'm following the documentation for this in https://stanfordnlp.github.io/stanza/training.html#converting-ud-data, but no info is given about the train, test, and dev data. Is the script going to generate the dev and test ones? Do I need to generate them? I'm new to this, and the language I'm trying to add is not in the Universal Dependencies, I have found some datasets in .conll format, which I have converted to .conllu following Stanza documentation.

    Any ideas?

    Thanks!

  • "AnnotationException: Could not handle incoming annotation" Problem [QUESTION]

    Greeting,

    I am new to CoreNLP enviroment and trying run the example code given on documentation. However, I got two errors as follows;

    First code: from stanza.server import CoreNLPClient with CoreNLPClient( annotators=['tokenize','ssplit','pos',"ner"], timeout=30000, memory='2G',be_quiet=True) as client: anno = client.annotate(text)

    2020-12-30 16:40:53 INFO: Writing properties to tmp file: corenlp_server-a15136448b834f79.props 2020-12-30 16:40:53 INFO: Starting server with command: java -Xmx2G -cp C:\Users\fatih\stanza_corenlp* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-a15136448b834f79.props -annotators tokenize,ssplit,pos,ner -preload -outputFormat serialized

    `Traceback (most recent call last):
    
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 446, in _request
        r.raise_for_status()
      File "C:\Users\fatih\anaconda3\lib\site-packages\requests\models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/?properties=%7B%27annotators%27%3A+%27tokenize%2Cssplit%2Cpos%2Cner%27%2C+%27outputFormat%27%3A+%27serialized%27%7D&resetDefault=false
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "<ipython-input-6-2fbdcdb77b41>", line 6, in <module>
        anno = client.annotate(text)
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 514, in annotate
        r = self._request(text.encode('utf-8'), request_properties, reset_default, **kwargs)
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 452, in _request
        raise AnnotationException(r.text)
    AnnotationException: Could not handle incoming annotation`
    

    What am I doing wrong? It's on windows, Anaconda, Spyder.

  • How can i run multiple stanza NER models parallel to eachother?

    How can i run multiple stanza NER models parallel to eachother?

    I want to run multiple stanza NER models, but i want to run them parallel to each other? how can I do so? I tried to do this using torch multiprocessing by creating multiple processes and each process run each models but it doesn't seem to go well

    processes = [] for i in range(4): # No. of processes p = mp.Process(target=test, args=(model,)) p.start() processes.append(p) for p in processes: p.join()

  • Dependency parsing in StanfordCoreNLP  and Stanza giving different result

    Dependency parsing in StanfordCoreNLP and Stanza giving different result

    I did dependency parsing using StanfordCoreNLP using the code below

    from stanfordcorenlp import StanfordCoreNLP
    nlp = StanfordCoreNLP('stanford-corenlp-full-2018-10-05', lang='en')
    
    sentence = 'The clothes in the dressing room are gorgeous. Can I have one?'
    tree_str = nlp.parse(sentence)
    print(tree_str)
    

    And I got the output:

      (S
        (NP
          (NP (DT The) (NNS clothes))
          (PP (IN in)
            (NP (DT the) (VBG dressing) (NN room))))
        (VP (VBP are)
          (ADJP (JJ gorgeous)))
        (. .)))
    

    How can I get this same output in Stanza??

    import stanza
    from stanza.server import CoreNLPClient
    classpath='/stanford-corenlp-full-2020-04-20/*'
    client = CoreNLPClient(be_quite=False, classpath=classpath, annotators=['parse'], memory='4G', endpoint='http://localhost:8900')
    client.start()
    text = 'The clothes in the dressing room are gorgeous. Can I have one?'
    ann = client.annotate(text)
    sentence = ann.sentence[0]
    dependency_parse = sentence.basicDependencies
    print(dependency_parse)
    
    

    In stanza It appears I have to split the sentences that makes up the sentence. Is there something I am doing wrong?

    Please note that my objective is to extract noun phrases.

  • PermanentlyFailedException: Timed out waiting for service to come alive. Part3

    PermanentlyFailedException: Timed out waiting for service to come alive. Part3

    Hi! I know this is similar to #52 and #91 but I am unable to understand how that was solved.

    When I run it on the commandline (Ubuntu : Ubuntu 16.04.6 LTS), it runs with success as below:

    java -Xmx16G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-34d0c1fe4d724a56.props -preload tokenize,ssplit,pos,lemma,ner
    
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - setting default constituency parser
    [main] INFO CoreNLP - using SR parser: edu/stanford/nlp/models/srparser/englishSR.ser.gz
    [main] INFO CoreNLP -     Threads: 5
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
    [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.2 sec].
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.7 sec].
    [main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
    [main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
    [main] INFO CoreNLP - Starting server...
    [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
    
    

    But when I run it with python script, it fail with error as below:

    
    import os
    os.environ["CORENLP_HOME"] = '/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05'
    
    # Import client module
    from stanza.server import CoreNLPClient
    
    
    client = CoreNLPClient(be_quite=False, classpath='"/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*"', annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], memory='16G', endpoint='http://localhost:9000')
    print(client)
    
    client.start()
    #import time; time.sleep(10)
    
    text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
    print (text)
    document = client.annotate(text)
    print ('malviya')
    print(type(document))
    

    Error:

    <stanza.server.client.CoreNLPClient object at 0x7fd296e40d68>
    Starting server with command: java -Xmx4G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05"/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-9a4ccb63339146d0.props -preload tokenize,ssplit,pos,lemma,ner
    Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.
    
    Traceback (most recent call last):
      File "stanza_eng.py", line 18, in <module>
        document = client.annotate(text)
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 431, in annotate
        r = self._request(text.encode('utf-8'), request_properties, **kwargs)
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 342, in _request
        self.ensure_alive()
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 161, in ensure_alive
        raise PermanentlyFailedException("Timed out waiting for service to come alive.")
    stanza.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.
    
    

    Python 3.6.10 asn1crypto==1.3.0 certifi==2020.4.5.1 cffi==1.14.0 chardet==3.0.4 cryptography==2.8 embeddings==0.0.8 gast==0.2.2 idna==2.9 numpy==1.18.2 protobuf==3.11.3 pycparser==2.20 pyOpenSSL==19.1.0 PySocks==1.7.1 requests==2.23.0 six==1.14.0 stanza==1.0.0 torch==1.4.0 tqdm==4.44.1 urllib3==1.25.8 vocab==0.0.4

    I am unable to understand the issue here...

  • Users from China suffer from connection issue when downloading Stanza models

    Users from China suffer from connection issue when downloading Stanza models

    Hi, there

    Could you help me to trace this issue? Here is my some info:

    • Network is okay without limitations
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import stanza
    
    if __name__ == '__main__':
        # https://github.com/stanfordnlp/stanza/blob/master/demo/Stanza_Beginners_Guide.ipynb
        # Note that you can use verbose=False to turn off all printed messages
        print("Downloading Chinese model...")
        stanza.download('zh', verbose=True)
    
        # Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
        print("Building a Chinese pipeline...")
        zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=True, use_gpu=False)
    
    C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\Scripts\python.exe C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py
    Downloading Chinese model...
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
        conn = connection.create_connection(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
        raise err
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
        sock.connect(sa)
    TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
        httplib_response = self._make_request(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
        self._validate_conn(conn)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 976, in _validate_conn
        conn.connect()
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 308, in connect
        conn = self._new_conn()
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
        raise NewConnectionError(
    urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 439, in send
        resp = conn.urlopen(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 724, in urlopen
        retries = retries.increment(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\retry.py", line 439, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py", line 9, in <module>
        stanza.download('zh', verbose=True)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 223, in download
        request_file(f'{DEFAULT_RESOURCES_URL}/resources_{__resources_version__}.json', os.path.join(dir, 'resources.json'))
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 83, in request_file
        download_file(url, path)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 66, in download_file
        r = requests.get(url, stream=True)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 530, in request
        resp = self.send(prep, **send_kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 643, in send
        r = adapter.send(request, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
    
    Process finished with exit code 1
    
    
  • [QUESTION] Can I run Stanza inside Docker container?

    [QUESTION] Can I run Stanza inside Docker container?

    Can I run Stanza inside docker container? I Created a container, installed all the dependencies, when the interpreter reaches the call [word.lemma for sent in doc_stanza.sentences for word in sent.words] the program just freezes without errors.

  • MWT and Pretokenized Text for Italian

    MWT and Pretokenized Text for Italian

    Hello! I'm using Stanza for Italian and I'm trying to generate a pred file starting with a gold file. Unfortunately, if I start with pretokenized text the new pipeline doesn't read mwt tokens, so I can't have file aligned. I saw a similar question (#95), but I don't think the problem has been solved... Can anyone help me?

  • ValueError: substring not found

    ValueError: substring not found

    Describe the bug when use the Vietnamese's POS, there have this problem To Reproduce Steps to reproduce the behavior:

    1. read the sentences s;
    2. call nlp(s); 3.'ValueError: substring not found' come out then.

    Environment (please complete the following information):

    • OS: CentOS
    • Python version: Python 3.6.8
    • Stanza version: 1.1.1

    Additional context

  • Is there an API to update existing NER models?

    Is there an API to update existing NER models?

    I have found documentation to be able to train NER models from scratch, but is there an API that'd allow one to update an existing model locally, adding both fresh text and annotations or fresh labels, onto say i2b2 or radiology?

  • Mismatched token output using custom stanza tokenizer

    Mismatched token output using custom stanza tokenizer

    Describe the bug I trained a custom stanza tokenizer and mwt on UD_English-GUM. When using the tokenizer & mwt for inference, the tokenizer changed the surface form of the word. For example, the word "subcontractor's" is tokenized as "subcontratrr 's" in the sentence:

    "The college is a state-funded uh uh remodel, and on state-funded remodels, we're required to pay prevailing wages. Uh prevailing wages, that, um, that indicate different levels of agility, of the different men working. And so, uh a lot of the crews, uh, like Mitchell, who have people that work under him, around town, in regular situations, come to the people like me, and ask us to do payroll for them. When we do the payroll for them, we state to them up front, that uh, we will pay the payroll, we will make the deductions, and then the employer contribution, which is approximately twenty-six percent, over and above the hourly wage, is also deducted, from the um subcontractor's check."

    To Reproduce Steps to reproduce the behavior:

    1. Train the tokenizer on UD_English-GUM
    2. Use the saved en_gum_tokenizer.pt model on other plain text

    Expected behavior subcontractor's -> subcontractor 's

    Environment (please complete the following information):

    • OS: CentOS 7
    • Python version: Python 3.7.11 from Anaconda
    • Stanza version: 1.3.0

    Additional context I have also tried the newest stanza version 1.4.2, while this issue is still there.

  • How to replicate results of stanza constituency parser on Penn Treebank data

    How to replicate results of stanza constituency parser on Penn Treebank data

    Hi, I'm trying to reproduce the results mentioned here for constituency parser on Penn treebank data. I have access to wsj data and I downloaded the wsj_bert.pt model by calling the following command:

    stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', package={'constituency': 'wsj_bert'})

    The model is successfully downloaded and it is saved here: ~/stanza_resources/en/constituency

    Now, I want to get the performance of this model on wsj test data. I called this command: ( I renamed test.trees to en_wsj_bert_test.mrg to keep the model's name and the data name consistents.)

    python -m stanza.utils.training.run_constituency en_wsj_bert --save_dir ~/stanza_resources/en/constituency --score_test

    This returns an awful score around 0.838210. I don't know where I make mistakes, but I would like to fix this. I'm going to use this as a baseline, so I need to replicate the scores exactly as mentioned here

    Thanks for your help

  • Inaccurate Dependency Tagging for Subordinates (ccomp)

    Inaccurate Dependency Tagging for Subordinates (ccomp)

    Greetings all,

    I working on extracting subordinate clauses via Stanza (Indeed through spacy-stanze); however, dependency parsing seems to provide inaccurate results.

    Following the guide from https://universaldependencies.org here, clausal subjects are tagged as csubj. For instance, the expected results should be as follows;

    import stanza
    import spacy_stanza
    import pandas as pd
    nlp = spacy_stanza.load_pipeline("en")
    sentence = 'what she said makes sense' 
    
    for t in doc:
        print(t.text, t.dep_, t.head.text)
    
    what dobj said
    she nsubj said
    said csubj was
    was ROOT was
    well advmod received
    received acomp was
    

    However, this is the results I get;

    What obj makes
    she nsubj said
    said acl:relcl What
    makes root makes
    sense obj makes
    . punct makes
    

    Stanza tags the item 'said'as a relative clause. As explained in This paper, the authors also used Stanza, yet I am not if it is a pretrained model or not. Why is the inconsistency? I've also tried other packages such as 'ewt' and got similar results. I am having a kind of the same issue with Spacy Models as well. Training a model from scratch would be beyond my knowledge. How should I proceed?

  • More combined models?

    More combined models?

    I saw the great idea for combined models here:

    https://stanfordnlp.github.io/stanza/combined_models.html

    Is there a process to request more of these? Specifically I was thinking of Hebrew right now.

Official Stanford NLP Python Library for Many Human Languages
Official Stanford NLP Python Library for Many Human Languages

Stanza: A Python NLP Library for Many Human Languages The Stanford NLP Group's official Python NLP library. It contains support for running various ac

Feb 17, 2021
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

May 22, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

May 8, 2022
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Sep 7, 2022
Google and Stanford University released a new pre-trained model called ELECTRA
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Sep 16, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Sep 22, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh Task-driven Embodied Agents that Chat Aishwarya Padmakumar*, Jesse Thomason*, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Ge

Sep 7, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

Mar 4, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Jul 25, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Jan 6, 2021
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Aug 16, 2022
Text editor on python tkinter to convert english text to other languages with the help of ployglot.
Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Transliterator Text Editor This is a simple transliteration program which is used to convert english word to phonetically matching word in another lan

Jan 16, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Mar 3, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Sep 20, 2022
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

Sep 16, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Sep 10, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

Sep 27, 2022