Transformers and long-term memory

Word2Vec - Word Embeddings

Word2vec is a type of natural language processing (NLP) algorithm that is used to represent words as numerical vectors, or "word embeddings." It was first introduced in a research paper by Tomas Mikolov et al. in 2013.

Word2vec works by analyzing large amounts of text data, such as a corpus of books, and using statistical models to learn the relationships between words. The algorithm then creates vector representations of each word based on its context within the text.

These vector representations can be used for a variety of NLP tasks, such as text classification, sentiment analysis, and machine translation. For example, in text classification, the word embeddings can be used to train a machine learning model that can accurately classify documents based on their content.

Word2vec has been used in a variety of AI applications, including chatbots, recommendation systems, and search engines. It has also been used to improve the accuracy of machine translation systems and to develop more sophisticated language models.

Transformer

Instead of fixed, pre-trained embeddings like Word2Vec, transformers usually generate their own embeddings as part of the model training process. These embeddings are dynamic and contextual, meaning that the representation of a word can change depending on the other words in the sentence. This contrasts with Word2Vec, where each word has a fixed embedding, regardless of the context in which it is used.

You write a story, not from memory, but a story the way a fiction author writes a story. You write sentence by sentence. As you continue writing, you remember pretty well what you've already written. When someone reads your story later, it makes sense. You don't (at best anyway) jump from heel to heel. The longer the story or your book is, the harder it becomes to avoid not repeating yourself. Haven't I told you this before? But generally, we do remember the foregoing. A transformer model, like us, possesses long-term memory. When you produce a sentence or short text, the system remembers all the preceding words and phrases. The model predicts the next word and adds it to long-term memory. As the model generates the text word by word, it can also focus on all the previous words. In this way, the model writes (or translates) a coherent text. By accessing a mountain of data (big data), it has a gigantic amount of references.

A transformer is called a transformer because it "transforms" one thing into another. In this case, it transforms words and sentences into numbers that a computer can understand.

Attention is all you need

So, when you talk to a transformer, it first "tokenizes" your words, which means it breaks them down into small pieces called tokens. Then, it uses a process called "attention" to figure out which tokens are related to each other and how they fit together.

The attention process works a bit like how you pay attention in a conversation. If someone is talking to you, you focus on what they're saying and try to understand how their words relate to each other. Similarly, the transformer focuses on the tokens you've given it and tries to understand how they relate to each other to make sense of what you're saying.

Once the transformer has figured out how the tokens relate to each other, it can turn them into numbers that a computer can understand. These numbers can be used to do all sorts of things, like translating languages, answering questions, or summarizing text.

Overall, a transformer is a really clever way of helping computers understand human language by transforming it into something that they can work with.

Long-term memory

The importance of a "long-term memory" was described in detail in the paper " Attention Is All You Need" (Vaswani et all, 2017). In doing so, the authors introduced a new neural network, the transformer model. An 'encoder' receives all previous text as 'input', a 'decoder' generates an output (the next word) step by step, while being fed with the previous 'output'. The attention mechanism and long-term memory allow 'transformers' to make better predictions. Previous types of neural networks attempted something similar, but possessed short-term memory. Thanks to the transformer architecture, natural language processing achieves unprecedented results.

Read Attention Is All You Need -  https://arxiv.org/pdf/1706.03762.pdf

' Deep learning' revolutionized the landscape of NLP, already discussed earlier. OpenAI, co-founded by Elon Musk, launched the GPT3(Generative Pre-trained Transformer) model in 2020. With its 175 billion parameters, it is the largest trained language model in the world. Other language models, such as Microsoft's Turing NLG, are ten times less powerful. The GPT3 model is capable of writing coherent texts based on a question or short text. Please note: this is not to say that the model types down real, scientific or politically correct information.

Next page