Word Embeddings

Giving a computer text insight

How can a computer system that actually only "understands" numbers gain some insight into the meaning of words and even more, into the meaning of a word in the context of a sentence or text?

When we use computers to understand language, we need a way to turn words into something the computer can work with. A computer usually only understands binary numbers, and through conversion also decimal or hexadecimal numbers. But how can you convert words and, even more, the meaning of words in their context, into numbers? One way we can do this is by using "vectors" to represent each word.

What is a "vector"?

What is a vector?

A vector is a “special” arrow that helps us show both how far we need to go and which way we need to go to get somewhere. It's like a treasure map that tells you how many steps to take in a certain direction to find the treasure.

For example, imagine you want to go to your friend's house, but you don't know where it is. You ask your mom for directions, and she says to go 5 blocks north and then 3 blocks east. If we draw an arrow starting at your current location and pointing in the direction of your friend's house, the length of the arrow represents the distance you need to travel, and the direction of the arrow shows you which way to go. This is a vector!

We can use vectors to describe lots of things in the world, like how fast a car is moving or which way the wind is blowing. They help us understand how things are moving and changing in our world.

A vector with multiple dimensions

One-dimensional

A vector with one dimension is like a regular number that goes back and forth on a number line. For example, if we want to show how far away a tree is, we might use a one-dimensional vector that just tells us how many steps away it is on the number line.

But sometimes we need to know more than one thing about a situation.For example, imagine you are playing a game where you move a character around on a screen. To keep track of where the character is, we need to know both the character's position along the x-axis (left and right) and the y-axis (up and down). This means we need a two-dimensional vector that tells us both how far the character is to the left or right and how far up or down.

And sometimes, we need even more information! For example, if we're flying a plane, we might need to know not just the plane's altitude (how high it is), but also its speed and direction. This means we need a three-dimensional vector that tells us how fast the plane is moving in three different directions: up and down, left and right, and forward and backward.

So, vectors can have multiple dimensions because sometimes we need to keep track of more than one thing at once. The more dimensions a vector has, the more information it can give us about a situation!

Three-dimensional vector (x, y, z)

Word embeddings

When we use computers to understand language, we need a way to turn words into something the computer can work with. One way we can do this is by using "vectors" to represent each word.

Imagine we have a big list of words, like "cat", "dog", "tree", "house", and so on. We can represent each of these words as a vector with lots of different numbers in it. Each number in the vector represents a different aspect of the word, like how common it is, or what other words it tends to appear with.

For example, the vector for "cat" might have a high number for "cute" and "furry", and a low number for "big" and "heavy". The vector for "dog" might have high numbers for "friendly" and "loyal", and a low number for "scary".

By using vectors like this, we can teach computers to understand the meaning of words based on the patterns of numbers in their vectors. We can also use these vectors to find words that are similar to each other, or to help computers translate between different languages.

An example:  

  • Sweetness: 8 (the apple is quite sweet)
  • Acidity: 3 (the apple is not very sour)
  • Size: 5 (the apple is an average size)
  • Colour: 9 (the apple is very red)

The vector for this particular apple would then be:   8,3,5,9. 

The characteristics in a graph.

These numbers represent how the apple's characteristics are perceived and evaluated. Each number in the vector helps form a numerical image that describes the apple's qualities.

Bron: https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5
So, word embeddings in NLP (nlp = natural language processing) use vectors to represent words, and these vectors can help computers understand and work with language!

How to create a word vector?

When we create a vector to represent a word, we choose which characteristics of the word we want to represent with numbers.

For example, let's say we want to create a vector for the word "apple". We might decide that we want to represent how sweet or sour the apple tastes, how big or small it is, and what color it is. So we might use three numbers in our vector to represent these characteristics. We might give a high number to sweetness, a low number to sourness, a medium number to size, and a high number to redness, for example.

Different people might choose different characteristics to represent in their vectors, depending on what they want to use the vectors for. If we were making a vector to help a computer recognize different types of fruit, for example, we might choose different characteristics than if we were making a vector to help a computer write a poem about fruit.

So, when we create a vector for a word, we get to decide what aspects of the word we want to represent with numbers. We might choose different characteristics depending on what we want to use the vectors for!

A human job

The decision of what characteristics of a word to represent with a number in a vector is typically made by humans. Usually, experts in the field of natural language processing (NLP) or machine learning work on developing these word embeddings.

To create word embeddings, researchers typically use algorithms to analyze large amounts of text data and identify patterns in the way words are used together. Then, they use their knowledge of language and expertise in the field to select which characteristics of a word are most important to represent in the vector.

For example, researchers might analyze a large collection of news articles and identify that the word "president" is often used together with words like "government", "election", and "policy". Based on this analysis, they might decide that their vector for "president" should include numbers representing these concepts.

So, while algorithms are used to analyze data and identify patterns, the decision of which characteristics to represent in the vector is ultimately made by humans with expertise in the field.

Measuring the distance

When we represent words as vectors, we can measure the distance between them even if they have different characteristics.

To understand this, let's imagine that we have a big box of crayons. Each crayon is a different color, but we can still measure the distance between them in the box. We might arrange the crayons from lightest to darkest, or group them by primary colors.

Similarly, when we represent words as vectors, we can measure the distance between them based on their numerical values. For example, we might represent the word "banana" with a vector that has a high value for "yellow" and "curved", while the word "government" might have a high value for "politics" and "power". Even though these words have different characteristics, we can still measure the distance between their vectors.

One way to measure distance between vectors is to use something called the "cosine similarity". This is a way of comparing the angles between the vectors, rather than just the distance between their numerical values.

Using the cosine similarity, we can calculate how similar two vectors are, even if they have different characteristics. For example, we might find that the word "government" is more similar to the word "policy" than it is to the word "banana", even though these words have very different characteristics. To understand it, imagine you have two toys that you want to compare. You can think of each toy as a list of numbers that describes its different features. For example, one toy might have a score of 1 for being soft, a score of 0 for being hard, and a score of 1 for being red. Another toy might have a score of 0 for being soft, a score of 1 for being hard, and a score of 0 for being red.

Now, you can use these lists of numbers to calculate the cosine distance between the two toys. Don't worry about what "cosine" means, we'll explain that in a second. The important thing to understand is that the cosine distance is a way to measure how different the two toys are from each other based on these scores.

Think of the cosine distance like a ruler that measures how far apart two things are. The farther apart the toys are, the higher the cosine distance will be. And the closer together they are, the lower the cosine distance will be.

So why is it called the cosine distance? Well, "cosine" is just a fancy word for a type of math function that's used to calculate the distance between the two toys. But you don't need to worry about that part too much. The important thing is that the cosine distance is a way to measure how different two things are from each other based on a list of scores or features.

So, by using vectors to represent words and measuring their distance with the cosine similarity, we can compare words with different characteristics and still understand how similar or different they are. Cosine similarity is similar to cosine distance, but instead of measuring how different two things are, it measures how similar they are.

Let's use the same toy example from before. Say you have two toys, and you want to compare them based on their features. You can think of each toy as a list of numbers that describes its features, like how soft it is, what color it is, etc.

To calculate the cosine similarity between the two toys, you use the same list of numbers to measure how similar they are. Specifically, you measure the angle between the two lists of numbers, and the closer the angle is to zero, the more similar the toys are.

Think of the cosine similarity like a thermometer that measures how hot or cold something is. The hotter something is, the more similar it is to other hot things. And the colder something is, the more similar it is to other cold things.

So why is it called cosine similarity? Well, "cosine" is just a fancy word for a type of math function that's used to calculate the similarity between the two toys. But you don't need to worry about that part too much. The important thing is that cosine similarity is a way to measure how similar two things are based on a list of features.

Contextualized word embeddings

An AI language model like ChatGPT, doesn't have a single fixed set of vectors for all the words. Instead, the model uses a technique called contextualized word embeddings, where the vectors for each word are generated on-the-fly based on the context in which the word appears.

When you ask ChatGPT a question or give it a prompt, it will analyze the text to understand the meaning and context of the words you used. Then, ChatGPT generates a vector representation for each word that takes into account the context in which it appears.

For example, if you ask the model the question "What is a dog?" it would generate a different vector for the word "dog" than it would if you asked me "What is a hot dog?". In the first case, the vector for "dog" would be more closely related to the concept of a four-legged animal, while in the second case, the vector for "dog" would be more closely related to the concept of a type of food.

This technique allows the model to generate more nuanced and accurate representations for each word, based on the context in which it appears. So while the model doesn't use the same vector characteristics for all the words it knows, it can generate vectors on-the-fly that are tailored to the specific context in which the words are being used.

Summary

Word embedding is a technique used in natural language processing (NLP) to represent words as numerical vectors. The goal of word embedding is to capture the semantic and syntactic relationships between words in a way that can be easily understood and processed by machine learning algorithms.

The Banana Example

One popular method of generating word embeddings is through the use of neural networks. The neural network is trained on a large corpus of text, and during the training process, it learns to assign a unique numerical vector to each word in the vocabulary. These vectors are known as "word embeddings."

Here is an example of a word embedding or (in this case) a 10-dimensional vector for the word "banana":

[0.452, -0.739, 0.817, 0.223, 0.094, -0.632, 0.222, -0.072, 0.512, -0.210]

Each number in this vector represents a different aspect of the word "banana." For example, the first number (0.452) might represent the fruitiness of the word, while the second number (-0.739) might represent its shape. The vector as a whole represents the meaning of the word in a way that can be easily processed by a machine learning algorithm.

A "vector representation" simply refers to the fact that the word is represented as a numerical vector. In NLP, we often use vectors to represent words, sentences, and even entire documents because they are easy to manipulate mathematically and can be used to calculate similarities and distances between different pieces of text.

A numerical vector is a mathematical object that consists of a collection of numbers arranged in a specific order. In NLP, numerical vectors are often used to represent text data, such as words, sentences, or documents, in a way that can be easily processed by machine learning algorithms.

Each number in this vector represents a different aspect of the word "banana," and the vector as a whole represents the meaning of the word in a numerical form that can be used in mathematical operations.

In general, numerical vectors can have any number of dimensions, and the values in each dimension can be any real number. Vectors can be added, subtracted, multiplied, and divided, and they can also be used to calculate distances and similarities between different vectors. This makes vectors a powerful tool for representing and processing data in many different fields, including NLP, computer vision, and machine learning.

Measuring the distance to the banana

One way to measure which words (or word vectors) are "close" to each other is by using a distance metric, such as cosine similarity or Euclidean distance. These distance metrics can be used to calculate the similarity or distance between two vectors, which can give us an indication of how "close" or "similar" the two vectors are.

Cosine similarityis one of the most commonly used distance metrics in NLP. It measures the cosine of the angle between two vectors in a high-dimensional space. The closer the cosine similarity is to 1, the more similar the vectors are, while a cosine similarity of 0 indicates that the vectors are completely dissimilar.

Euclidean distance, on the other hand, measures the straight-line distance between two vectors in a high-dimensional space. The smaller the Euclidean distance, the closer the vectors are.

To calculate the similarity or distance between two word vectors using cosine similarity or Euclidean distance, we simply take the dot product or the difference between the two vectors, respectively, and normalize the result. Here is an example of how we might calculate the cosine similarity between two word vectors:

By measuring the similarity or distance between different word vectors, we can gain insights into the semantic and syntactic relationships between different words, which can be useful for a wide range of NLP tasks, such as sentiment analysis, text classification, and machine translation.

Capturing semantic relationships with "banana"

How can the word "banana" can be related to both "ape" and "ice cream"?

One common method of exploring the semantic relationships between words using word embeddings is to look at the nearest neighborsof a particular word in the embedding space. The nearest neighbors are the words that have the most similar word vectors to the target word.

So, let's say we have a word embedding trained on a large corpus of text, and we want to explore the semantic relationships of the word "banana" in this embedding space. We can look at the top nearest neighbors of "banana" based on cosine similarity, and see what other words are most similar to it.

Here are the top 5 nearest neighbors of "banana" in a pre-trained word embedding:

  • banana
  • fruit
  • mango
  • pineapple
  • papaya

As you can see, most of the nearest neighbors of "banana" are other types of fruit, which makes sense since they share similar attributes, such as being edible and growing on trees.

Now, let's look at the nearest neighbors of "banana" that are related to "ape":

  • ape
  • gorilla
  • chimpanzee
  • orangutan
  • monkey

In this case, the nearest neighbors of "banana" are all different types of primates, which makes sense since primates like to eat bananas and are often associated with them.

Finally, let's look at the nearest neighbors of "banana" that are related to "ice cream":

  • ice-cream
  • gelato
  • sorbet
  • frozen-yogurt
  • strawberry

In this case, the nearest neighbors of "banana" are all different types of frozen desserts, which makes sense since bananas are often used as a flavoring in ice cream and other frozen treats.

So, as you can see, word embeddings can capture the semantic relationships between words in a way that reflects how humans understand and use language. By examining the nearest neighbors of different words in an embedding space, we can gain insights into how different words are related to each other and what attributes they share.

Detecting the relationships

Word embeddings are typically created using unsupervised learning algorithmsthat are trained on large amounts of text data, such as news articles or web pages. During the training process, the algorithm learns to assign a numerical vector to each word in the corpus based on its co-occurrence patterns with other words in the text. This means that words that appear in similar contexts are assigned similar vectors, which reflects their semantic similarity.

To detect the semantic relationship between words in a word embedding, we can use distance metrics such as cosine similarity or Euclidean distance to measure the similarity or distance between the word vectors. In the case of "banana" and "strawberry", their vectors are likely to be similar because they are both types of fruit and often appear in similar contexts. Therefore, their cosine similarity or Euclidean distance will be relatively small compared to the distance between "chimpanzee" and "strawberry".

The reason why word embeddings can capture semantic relationships between words is that the vectors encode information about the meaning and usage of words based on their co-occurrence patterns in the training corpus. For example, if the word "banana" appears frequently in the same context as words like "fruit", "sweet", and "yellow", the word embedding will learn to assign a vector to "banana" that is similar to the vectors of other words that also frequently appear in those contexts.

By looking at the nearest neighbors of a word in the embedding space, we can see which other words are most similar to it based on their co-occurrence patterns in the training corpus. This can help us to identify semantic relationships between words that might not be immediately apparent from their surface form or dictionary definition.

To detect a similar context, word embeddings typically use a technique called " distributional semantics." This means that words that appear in similar contexts are assumed to have similar meanings. The context of a word refers to the words that appear around it in a given text or corpus.

For example, if the word "banana" frequently appears in the same contexts as words like "fruit," "yellow," and "peel," then the word embedding will learn to represent "banana" as being semantically similar to these other words.

Sliding windows

The context of a word can be defined in different ways, depending on the specific algorithm used to create the word embedding. One common approach is to use a "sliding window" technique, where a fixed window of a certain number of words is moved over the text corpus, and the words within that window are used to define the context of each word. 

For example, if we use a sliding window of size 5, then for each word in the text corpus, we would consider the 5 words to its left and the 5 words to its right as its context. This means that any two words that appear in similar contexts in the text corpus are likely to have similar vector representations in the embedding space.

There are also more advanced techniques for defining context, such as using neural networks or probabilistic models to learn the most informative context for each word.

Overall, the goal of distributional semantics is to capture the meaning of words based on their usage patterns in a given text corpus. By analyzing the contexts in which words appear, word embeddings can learn to represent words in a way that reflects their semantic relationships with other words in the corpus.

Let's say we have the following sentence:

"I peeled the ripe banana and ate it for breakfast."

If we use a sliding window of size 5, we would move the window over the sentence as follows:

  • " I peeled theripe bananaand ate it for breakfast." (window centered on "the")
  • "I peeled the ripebanana andate it for breakfast." (window centered on "ripe")
  • "I peeled the ripe bananaand ateit for breakfast." (window centered on "banana")
  • "I peeled the ripe banana andate itfor breakfast." (window centered on "and")
  • "I peeled the ripe banana and ateit forbreakfast." (window centered on "ate")
  • "I peeled the ripe banana and ate itfor breakfast." (window centered on "it")
  • ...

A sliding windowis a technique used to determine the context of a word in a text corpus. A sliding window of size N is a window that moves across the text corpus with a fixed stride of 1 word at a time, and with a size of N words. For example, if the text corpus is "I peeled the ripe banana and ate it for breakfast", and we use a sliding window of size 5, then the window would first be centered on the first 5 words of the text corpus: "I peeled the ripe banana". We would then move the window one word to the right and center it on the next 5 words: "peeled the ripe banana and ate". This process would continue until we have covered the entire text corpus.

For each center word in the sliding window, the surrounding words within the window are considered its context. These words are used to generate a vector representation for the center word in the word embedding.

So, for example, if we use a sliding window of size 5 on the text corpus "I peeled the ripe banana and ate it for breakfast" and center the window on the word "banana", then the context words would be "I", "peeled", "the", "ripe", and "and". These words would be used to generate the vector representation for the word "banana" in the word embedding.

Transformers and word embeddings / Word2vec

Word2Vec focuses on learning semantic relationships between words, while a transformer, such as BERT or GPT, teaches contextual representations of words and their relationships. Transformers are more versatile and can be applied to a wide range of NLP tasks.
NLP, or Natural Language Processing, is a technology that helps computers understand and work with human language. It's like teaching computers to talk like us! With NLP, computers can read, listen, and understand what we say or write. They can even answer our questions or have conversations with us. NLP is used in things like voice assistants (like Siri or Alexa), language translation apps, and even in some video games. It's all about making computers smarter when it comes to understanding and using language, just like we do!
Next page