import numpy as np
def cosine_similarity(vec1: np.array, vec2: np.array) -> float:
return np.dot(vec1, vec2) / ( np.linalg.norm(vec1) * np.linalg.norm(vec2) )
Exercise: Embedding similarity
Task: Use the OpenAI embeddings API to compute the similarity between two given words or phrases.
Instructions:
- Choose two words or phrases with similar or related meanings.
- Use the OpenAI embeddings API to obtain embeddings for both words or phrases.
- Calculate the cosine similarity between the embeddings to measure their similarity.
- Print the similarity score and interpret the results.
Show solution
import os
from llm_utils.client import get_openai_client, OpenAIModels
= OpenAIModels.EMBED.value
MODEL
# get the OpenAI client
= get_openai_client(
client =MODEL,
model=os.environ.get("CONFIG_PATH")
config_path
)
# create the embeddings
= "king"
word_1 = "queen"
word_2
= client.embeddings.create(input=word_1, model=MODEL)
response_1 = response_1.data[0].embedding
embedding_1 = client.embeddings.create(input=word_2, model=MODEL)
response_2 = response_2.data[0].embedding embedding_2
# calculate the distance
= cosine_similarity(embedding_1, embedding_2)
dist_12 print(f"Cosine similarity between {word_1} and {word_2}: {round(dist_12, 3)}.")
Cosine similarity between king and queen: 0.915.
= "pawn"
word_3 = client.embeddings.create(input=word_3, model=MODEL).data[0].embedding
embedding_3
= cosine_similarity(embedding_1, embedding_3)
dist_13 print(f"Cosine similarity between {word_1} and {word_3}: {round(dist_13, 3)}.")
Cosine similarity between king and pawn: 0.829.
Task: Use the OpenAI embeddings API and simple embedding arithmetics to introduce more context to word similarities.
Instructions:
- Create embeddings for the following three words:
python, snake, javascript
using the OpenAI API. - Calculate the cosine similarity between each pair.
- Create another embedding for the word
reptile
and add it topython
. You can usenumpy
for this. - Calculate the cosine similarity between
python
and this sum. What do you notice?
Show solution
= ["python", "snake", "javascript", "reptile"]
words = client.embeddings.create(input=words, model=MODEL)
response = [emb.embedding for emb in response.data] embeddings
print(f"Similarity between '{words[0]}' and '{words[1]}': {round(cosine_similarity(embeddings[0], embeddings[1]), 3)}.")
print(f"Similarity between '{words[0]}' and '{words[2]}': {round(cosine_similarity(embeddings[0], embeddings[2]), 3)}.")
print(f"Similarity between '{words[0]} + {words[3]}' and '{words[1]}': {round(cosine_similarity(np.array(embeddings[0]) + np.array(embeddings[3]), embeddings[1]), 3)}.")
Similarity between 'python' and 'snake': 0.841.
Similarity between 'python' and 'javascript': 0.85.
Similarity between 'python + reptile' and 'snake': 0.894.