How do you find the similarity of a sentence in python?

Question

Most of there libraries below should be good choice for semantic similarity comparison. You can skip direct word comparison by generating word, or sentence vectors using pretrained models from these libraries.

Nội dung chính Show

Sentence similarity with Spacy
Sentence similarity with Sentence Transformers
Sentence similarity with TFHub Universal Sentence Encoder
Other Sentence Embedding Libraries
How do you find the similarity between two sentences in Python?
How do you check text similarity in Python?
How do you find the similarity of a sentence?
How does Python calculate similarity?

Sentence similarity with Spacy

Required models must be loaded first.

For using en_core_web_md use python -m spacy download en_core_web_md to download. For using en_core_web_lg use python -m spacy download en_core_web_lg.

The large model is around ~830mb as writing and quite slow, so medium one can be a good choice.

https://spacy.io/usage/vectors-similarity/

Code:

import spacy
nlp = spacy.load("en_core_web_lg")
#nlp = spacy.load("en_core_web_md")


doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')


print(doc1.similarity(doc2)) 
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))

Output:

0.7003971105290047
0.9671912343259517
0.6121211244876517

Sentence similarity with Sentence Transformers

https://github.com/UKPLab/sentence-transformers

https://www.sbert.net/docs/usage/semantic_textual_similarity.html

Install with pip install -U sentence-transformers. This one generates sentence embedding.

Code:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

sentences = [
    'the person wear red T-shirt',
    'this person is walking',
    'the boy wear red T-shirt'
    ]
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Output:

Sentence: the person wear red T-shirt
Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]

Sentence: this person is walking
Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]

Sentence: the boy wear red T-shirt
Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]

Now embedding vector can be used to calculate various similarity metrics.

Code:

from sentence_transformers import SentenceTransformer, util
print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[1]))
print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[2]))
print(util.pytorch_cos_sim(sentence_embeddings[1], sentence_embeddings[2]))

Output:

tensor([[0.4644]])
tensor([[0.9070]])
tensor([[0.3276]])

Same thing with scipy and pytorch,

Code:

from scipy.spatial import distance
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))

Output:

0.4643629193305969
0.9069876074790955
0.3275738060474396

Code:

import torch.nn
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
b = torch.from_numpy(sentence_embeddings)
print(cos(b[0], b[1]))
print(cos(b[0], b[2]))
print(cos(b[1], b[2]))

Output:

tensor(0.4644)
tensor(0.9070)
tensor(0.3276)

Sentence similarity with TFHub Universal Sentence Encoder

https://tfhub.dev/google/universal-sentence-encoder/4

https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

Model is very large for this one around 1GB and seems slower than others. This also generates embeddings for sentences.

Code:

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
    "the person wear red T-shirt",
    "this person is walking",
    "the boy wear red T-shirt"
    ])

print(embeddings)

Output:

tf.Tensor(
[[ 0.063188    0.07063895 -0.05998802 ... -0.01409875  0.01863449
   0.01505797]
 [-0.06786212  0.01993554  0.03236153 ...  0.05772103  0.01787272
   0.01740014]
 [ 0.05379306  0.07613157 -0.05256693 ... -0.01256405  0.0213196
  -0.00262441]], shape=(3, 512), dtype=float32)

Code:

from scipy.spatial import distance
print(1 - distance.cosine(embeddings[0], embeddings[1]))
print(1 - distance.cosine(embeddings[0], embeddings[2]))
print(1 - distance.cosine(embeddings[1], embeddings[2]))

Output:

0.15320375561714172
0.8592830896377563
0.09080004692077637

Other Sentence Embedding Libraries

https://github.com/facebookresearch/InferSent

https://github.com/Tiiiger/bert_score

This illustration shows the method,

Resources

How to compute the similarity between two text documents?

https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity

https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html

https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity

https://nlp.town/blog/sentence-similarity/

How do you find the similarity between two sentences in Python?

See how the Python code works to find sentence similarity.

Take two strings as input..

Create tokens out of those strings..

Initialize two empty lists..

Create vectors out of the tokens and append them into the lists..

Compare the two lists using the cosine formula..

Print the result..

How do you check text similarity in Python?

Implementation. Install Gensim, get the “ text8 ” dataset to train the Doc2Vec model. Tag the text data, then use it to build the model vocabulary and train the model. Use the model to get the sentence embeddings of the headlines and calculate the cosine similarity between them.