Satyajeet Jadhav

9 months ago

Embeddings eli5 version

Computers don’t understand text like humans do. Computers understand numbers. To get computers to understand text, you need to convert the text to numbers.

But simply transforming text to numbers won’t make a lot of sense. Words make sense when they are next to each other. So the numberical representation should capture this relation.

One way to achieve this is by plotting these numbers on a graph.  

As a simplistic example, consider a line. Higher numbers on the line represent more sweetness. The word apple can be represented by the number 1. The word juice can be represented by a number 2. The distance between apple and juice is 2-1 = 1.  So we could say that the sweetness of apple and this juice is similar.

But words have more complex attributes than just sweetness. It could be shape, size, color, or any other arbitrary attribute. Each of these could become a line or a dimension by itself. What if we could use two dimensions instead of one? For example, x dimension is for sweetness. y dimension is for sourness. We could then plot apple at (1,1) and juice at (2,2). Orange could be (1,2). Orange Juice could be (2,1). We could arbitrarily keep adding more such dimensions. A language model could come up with its own characteristics for a dimension. Though one can’t really visualize hundreds of dimensions, theoretically, they are possible for a computer.  Thus, words and sentences become points on a graph of n-dimensions.

There is an excellent explainer by Dharmesh Shah on what embeddings are. I don’t think anyone can explain it better. Please go and read that if you want to understand embeddings better - https://simple.ai/p/guide-vector-embeddings

A lot of models are available today that let you convert words to embeddings. There are a few categories of models here.

Companies like OpenAI, Cohere, etc. have powerful and large models that run on their servers. These companies expose their embeddings API for a small fee. You can send your text to the API and get back embeddings as the response. Recent advancements in AI technology have made these models really cheap.

The other category of models is Llama, Mistral, etc. These too are large and powerful. But these are open. This means, anyone can download these models and run them on their servers. Since these are large models, they can’t typically run from within every user’s browser.

The third category is models like Xenova, and Nombic. These are small models. They are open. They can run in your user’s browsers! But the obvious tradeoff is less accuracy.

Never miss a post from
Satyajeet Jadhav

Get notified when Satyajeet Jadhav publishes a new post.

Comments

Participate in the conversation.

Read More

Semantic Search, aka Magic

The related notes feature searches all your notes to find the ones that are closest in meaning to your current note.Searching notes to find text similar in meaning to your query is called semantic search. We are trying to build a semantic search engine.