InferSent sentence embeddings

InferSent

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.

We provide our pre-trained English sentence encoder from our paper and our SentEval evaluation toolkit.

Recent changes: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.

Dependencies

This code is written in python. Dependencies include:

  • Python 2/3
  • Pytorch (recent version)
  • NLTK >= 3

Download word vectors

Download GloVe (V1) or fastText (V2) vectors:

mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/
mkdir fastText
curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip fastText/crawl-300d-2M.vec.zip -d fastText/

Use our sentence encoder

We provide a simple interface to encode English sentences. See demo.ipynb for a practical example. Get started with the following steps:

0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:

mkdir encoder
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.

0.1) Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

1) Load our pre-trained model (in encoder/):

from models import InferSent
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

2) Set word vector path for the model:

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

3) Build the vocabulary of word vectors (i.e keep only those needed):

infersent.build_vocab(sentences, tokenize=True)

where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common English words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK.

4) Encode your sentences (list of n sentences):

embeddings = infersent.encode(sentences, tokenize=True)

This outputs a numpy array with n vectors of dimension 4096. Speed is around 1000 sentences per second with batch size 128 on a single GPU.

5) Visualize the importance that our model attributes to each word:

We provide a function to visualize the importance of each word in the encoding of a sentence:

infersent.visualize('A man plays an instrument.', tokenize=True)

Model

Evaluate the encoder on transfer tasks

To evaluate the model on transfer tasks, see SentEval. Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:

Model MR CR SUBJ MPQA STS14 STS Benchmark SICK Relatedness SICK Entailment SST TREC MRPC
InferSent1 81.1 86.3 92.4 90.2 .68/.65 75.8/75.5 0.884 86.1 84.6 88.2 76.2/83.1
InferSent2 79.7 84.2 92.7 89.4 .68/.66 78.4/78.4 0.888 86.3 84.3 90.8 76.0/83.8
SkipThought 79.4 83.1 93.7 89.3 .44/.45 72.1/70.2 0.858 79.5 82.9 88.4 -
fastText-BoV 78.2 80.2 91.8 88.0 .65/.63 70.2/68.3 0.823 78.9 82.3 83.4 74.4/82.4

Reference

Please consider citing [1] if you found this code useful.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

@InProceedings{conneau-EtAl:2017:EMNLP2017,
  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo\"{i}c  and  Bordes, Antoine},
  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {670--680},
  url       = {https://www.aclweb.org/anthology/D17-1070}
}

Related work

Comments
  • ValueError: some of the strides of a given numpy array are negative. This is currently not supported,

    ValueError: some of the strides of a given numpy array are negative. This is currently not supported,

    Trying to use pretrained InferSent2 Model ( i.e. fastext) to encode sentences. The pretrained model works perfectly on CPU (where I have torch=0.4.1) However, it crashes with cuda backend (where I have torch 1.0.0.dev20181017)

     File ".../InferSent/models.py", line 224, in encode
        batch = self.forward((batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
      File ".../InferSent/models.py", line 66, in forward
        sent_packed = nn.utils.rnn.pack_padded_sequence(sent, sent_len_sorted)
      File ".../python3.7/site-packages/torch/nn/utils/rnn.py", line 147, in pack_padded_sequence
        lengths = torch.as_tensor(lengths, dtype=torch.int64)
    ValueError: some of the strides of a given numpy array are negative. This is currently not supported, but will be added in future releases.
    
     $ python --version
    Python 3.7.0
    $ pip list | grep torch
    torch           1.0.0.dev20181017
    $ nvidia-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
    |-------------------------------+----------------------+----------------------+
    
  • RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

    RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

    I tried to run the train_nli.py file to train your model but got the following error.

    Traceback (most recent call last):
      File "train_nli.py", line 283, in <module>
        train_acc = trainepoch(epoch)
      File "train_nli.py", line 176, in trainepoch
        output = nli_net((s1_batch, s1_len), (s2_batch, s2_len))
      File "/if5/wua4nw/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in __call__
        result = self.forward(*input, **kwargs)
      File "/net/if5/wua4nw/wasi/academic/research_with_prof_chang/fb_research_repos/InferSent/models.py", line 731, in forward
        u = self.encoder(s1)
      File "/if5/wua4nw/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in __call__
        result = self.forward(*input, **kwargs)
      File "/net/if5/wua4nw/wasi/academic/research_with_prof_chang/fb_research_repos/InferSent/models.py", line 44, in forward
        idx_sort = torch.cuda.LongTensor(idx_sort) if self.use_cuda else torch.LongTensor(idx_sort)
    RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)
    

    Any guess why I am getting this error? I am using python 3.5, can it be a reason?

  • Model is no longer available in SW3 amazon bucket

    Model is no longer available in SW3 amazon bucket

    I had a working code that simply loads the infersent model. Now, it wont unpickle the model

    MODEL_PATH = "./encoder/infersent1.pkl" 
        params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                    'pool_type': 'max', 'dpout_model': 0.0, 'version': 
        model_version}
        inferSent = InferSent(params_model)
        print(MODEL_PATH)
        inferSent.load_state_dict(torch.load(MODEL_PATH))
    
    use_cuda = False
    inferSent = inferSent.cuda() if use_cuda else inferSent
    # If infersent1 -> use GloVe embeddings. If infersent2 -> use InferSent 
    embeddings.
    W2V_PATH = './dataset/GloVe/glove.840B.300d.txt' if model_version == 1 else 
    '../dataset/fastText/crawl-300d-2M.vec'
    inferSent.set_w2v_path(W2V_PATH)
    

    It results in error UnpicklingError: invalid load key, '<'.``

  • get_data script malfunction

    get_data script malfunction

    $ ./get_data.bash http://nlp.stanford.edu/data/glove.840B.300d.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 315 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 2075M 100 2075M 0 0 383k 0 1:32:18 1:32:18 --:--:-- 545k Archive: glove.840B.300d.zip warning [glove.840B.300d.zip]: 76 extra bytes at beginning or within zipfile (attempting to process anyway) error [glove.840B.300d.zip]: reported length of central directory is -76 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1 zipfile?). Compensating... skipping: glove.840B.300d.txt need PK compat. v4.5 (can do v2.1)

    note: didn't find end-of-central-dir signature at end of central dir. (please check that you have transferred or created the zipfile in the appropriate BINARY mode and that you have compiled UnZip properly)

  • Infersent1 and Inferset2 pretrained models are the same !!

    Infersent1 and Inferset2 pretrained models are the same !!

    curl not working with this links

    curl -Lo encoder/infersent1.pickle https://dl.fbaipublicfiles.com/infersent/infersent1.pkl curl -Lo encoder/infersent2.pickle https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

    and after trying

    curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl

    The models are downloaded but it seems that they contain the same thing.

    Related to: https://github.com/facebookresearch/InferSent/issues/93

  • Fixed reproducability issue with newer pytorch versions.

    Fixed reproducability issue with newer pytorch versions.

    For Pytorch 1.0.0, you have to specify the aggregation of losses over the batch torch.nn.CrossEntropyLoss already when calling the constructor. Changing the attribute manually afterwards does not have an effect. This causes the reproducability issue in #96 , which my commit fixes. It also fixes crashes that occur with PyTorch 1.0.0. due to improper access of scalar values.

  • Two pooling issues

    Two pooling issues

    Hi, thanks for sharing this code! I noticed two issues with the current implementation of mean- / max- pooling over BiLSTM.

    1. sent_len is not unsorted before used for normalization. At Line 46 sent_len is sorted from biggest to smallest, and the input embeddings are adjusted accordingly. At Line 61 the hidden states are rearranged into the original order, while sent_len is not. This might lead to incorrect normalization of mean-pooling.

    2. Padding is not handled before pooling. As a result the encoded sentence and thus the prediction are dependent on the number of paddings. I'm not sure if this is by design or a mistake. I ran into a case where running in batch vs running on each example give me different predictions, as shown below. Note that this result might not be directly reproducible as only the trained encoder is released and this example is generated from a SNLI classifier I trained on top of the released encoder.

    ss1 = [
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
        ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>']
    ]
    ss2 = [
        ['<s>', 'A', 'man', 'is', 'repainting', 'a', '</s>'],
        ['<s>', 'man', 'is', 'repainting', 'a', 'garage', '</s>'],
        ['<s>', 'A', 'man', 'is', 'a', 'garage', '</s>'],
        ['<s>', 'A', 'is', 'repainting', 'a', 'garage', '</s>'],
        ['<s>', 'A', 'man', 'repainting', 'a', 'garage', '</s>'],
        ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'shirt', '</s>'],
        ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'blue', '</s>'],
        ['<s>', 'A', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>'],
        ['<s>', 'A', 'man', 'wearing', 'a', 'blue', 'shirt', '</s>'],
        ['<s>', 'man', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>']
    ]
    
    k = 6
    ss1 = ss1[:k]
    ss2 = ss2[:k]
    model.eval()
    s1, s1_len = get_batch(ss1, word_vec)
    s2, s2_len = get_batch(ss2, word_vec)
    s1 = Variable(s1.cuda())
    s2 = Variable(s2.cuda())
    p = torch.max(model((s1, s1_len), (s2, s2_len)), 1)[1].data.cpu().numpy()
    for i in range(len(ss1)):
        b = (get_batch([ss1[i]], word_vec), get_batch([ss2[i]], word_vec))
        print(p[i], torch.max(forward(model, b), 1)[1].data.cpu().numpy()[0])
    
    output:
    1 0
    1 1
    1 1
    1 1
    1 1
    0 0
    

    The second issue might be related to Issue #48 .

    I made an attempt to fix these two issues in my pull request. With pooling issues fixed I trained a SNLI classifier from scratch. Performance increased a little on SNLI (dev 84.56, test 84.7), but decreased on almost all transfer tasks. Here are numbers I got (Fork column):

    | Task | SkipThought | InferSent | Fork | |------------------|-------------|-----------|-----------| | MR | 79.4 | 81.1 | 79.86 | | CR | 83.1 | 86.3 | 83.16 | | SUBJ | 93.7 | 92.4 | 92.45 | | MPQA | 89.3 | 90.2 | 90.01 | | STS14 | .44/.45 | .68/.65 | .65/.62 | | SICK Relatedness | 0.858 | 0.884 | 0.877 | | SICK Entailment | 79.5 | 86.1 | 85.45 | | SST2 | 82.9 | 84.6 | 81.77 | | SST5 | - | - | 44.03 | | TREC | 88.4 | 88.2 | 85.4 | | MRPC | - | 76.2/83.1 | 74.2/81.8 |

  • Download InferSent models via curl request failing

    Download InferSent models via curl request failing

    I tried downloading both of the infersent models provided via link: curl -Lo encoder/infersent1.pkl https://s3.amazonaws.com/senteval/infersent/infersent1.pkl curl -Lo encoder/infersent2.pkl https://s3.amazonaws.com/senteval/infersent/infersent2.pkl

    Both of them throw same error:

    Warning: Failed to create the file encode/infersent/allnli.pickle: No such file or directory

    Is there an updated path to download them?

  • InferSent encoder demo with GloVe - Key Error

    InferSent encoder demo with GloVe - Key Error

    I'm trying to run the demo.ipynb notebook in the encoder module, with 300 dimensional GloVe vectors. I've run all the commands as detailed in the Readme and the notebook, but at the model.encode command I get an error as follows:

    KeyError                                  Traceback (most recent call last)
    <ipython-input-36-3fb4b1a1a3f7> in <module>()
    ----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
          2 print('nb sentences encoded : {0}'.format(len(embeddings)))
    
    /ais/hal9000/vkpriya/InferSent-master/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
        220         for stidx in range(0, len(sentences), bsize):
        221             batch = Variable(self.get_batch(
    --> 222                         sentences[stidx:stidx + bsize]), volatile=True)
        223             if self.is_cuda():
        224                 batch = batch.cuda()
    
    /ais/hal9000/vkpriya/InferSent-master/encoder/models.py in get_batch(self, batch)
        172         for i in range(len(batch)):
        173             for j in range(len(batch[i])):
    --> 174                 embed[j, i, :] = self.word_vec[batch[i][j]]
        175 
        176         return torch.FloatTensor(embed)
    
    KeyError: </s>
    
    

    Should I explicitly add the symbol to the word vector file? Thanks!

  • BLSTM Encoder retreival issue

    BLSTM Encoder retreival issue

    Got this error while trying to use the pretrained model for generating sentence embeddings on a local dataset:

    /usr/local/lib/python2.7/dist-packages/torch/serialization.py:284: SourceChangeWarning: source code of class 'models.BLSTMEncoder' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
      warnings.warn(msg, SourceChangeWarning)
    26126
    Traceback (most recent call last):
      File "sentEmbed.py", line 18, in <module>
        embeddings = infersent.encode(sentences, bsize=128, tokenize=False, verbose=True)
      File "/home/ritvik/InferSent-master/encoder/models.py", line 198, in encode
        sentences, bsize, tokenize, verbose)
      File "/home/ritvik/InferSent-master/encoder/models.py",  line 175, in prepare_samples
        s_f = [word for word in sentences[i] if word in self.word_vec]
      File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 238, in __getattr__
        type(self).__name__, name))
    AttributeError: 'BLSTMEncoder' object has no attribute 'word_vec'
    
    

    Used the latest AllNLI pickle. A similar bug was present in the SentEval repository, but the bug fix there doesn't seem to apply here.

  • search of similar sentences

    search of similar sentences

    I have a scenario where I need to solve following issue. It is not really the issue with this repo but more like algorithmic help in using Infersent in a particular case.

    So I have a set of n sentences for which I have created encodings using Infersent.

    Now I get a query sentence and I want to find similar sentences (among the n sentences) to this sentence.

    How do I arrange the n sentences, meaning in what data structure so that I can quickly find top k similar sentences?

    Generally, how do I perform search when it comes to sentence encodings.

  • ModuleNotFoundError: No module named 'models'

    ModuleNotFoundError: No module named 'models'

    Getting this error on Google colab.

    Both torch & torchvision seem to be preinstalled.

    !pip install torchvision returns

    Requirement already satisfied: torchvision
    Requirement already satisfied: torch
    .........
    
    import torch
    from models import InferSent
    
  • Vector Concatenation Question

    Vector Concatenation Question

    I am wondering if there are more explanations about the vectors concatenation. In the paper, u, v, uv, and |u-v| are concat and then it is used for the Softmax loss. |u-v| seems to be intuitive since when u and v are close, this goes to near zero (semantically similar). But the meaning of u, v and uv are not clear, would you elaborate more? Also, I see some work took square of max(u, v), and it seems to be working well. Overall, I am curious why those are working and how they are designed.

  • added weigh_words method to get importance of words

    added weigh_words method to get importance of words

    Added a new weigh_words method. Using weigh_words we can get the words and their importance directly instead of only being able to visualize it. It uses the logic that was previously in visualize to get the important words and then reuses this method in visualize.

  • Does this work anymore?

    Does this work anymore?

    I have tried multiple installation options, and multiple run options, but keep getting runtime errors or module errors. Is this code just out-of-date and unusable now?

    If it is still working for you, what version of python and pytorch are you running it on?

  • Cuda by Default

    Cuda by Default

    Hi there, It is actually just a part of my group project, & I am not aware of the model completely. I am trying to run this particular part, # cuda by default nli_net.cuda() loss_fn.cuda() and getting this error message,

    Traceback (most recent call last): File "train_nli.py", line 129, in <module> nli_net.cuda() File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 260, in cuda return self._apply(lambda t: t.cuda(device)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 187, in _apply module._apply(fn) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 187, in _apply module._apply(fn) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 117, in _apply self.flatten_parameters() File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Jun 29, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Jul 1, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Feb 18, 2021
SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Jul 1, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Apr 27, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Jun 16, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

Jun 23, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Jun 16, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Jan 12, 2021
Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Jun 28, 2022
Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Feb 17, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Jun 28, 2022
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Jun 25, 2022
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Jun 14, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

Jun 29, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Jun 30, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Jun 25, 2022
A Structured Self-attentive Sentence Embedding
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Jun 24, 2022