Transformer
Back to Glossary
A Transformer is a more powerful type of neural network that can comprehend and produce sequences of information, like text, audio, or even images.
Imagine a Transformer as an extremely intelligent AI that can look at an entire sentence (or even a whole paragraph!) in one go and know how each of the words is connected to one another. It’s akin to having an extremely focused listener who doesn’t merely pay attention to the latest word but who recalls and takes into consideration everything before and subsequent to that, as well. Such a characteristic renders Transformers much too potent at performing such undertakings as translating languages, composing tales, and responding to your questions helpfully.
Why Did We Need Something New? The Problem with Old Ways
Prior to Transformers, the primary means by which AI comprehended sequences such as sentences were through models known as Recurrent Neural Networks (RNNs) and their more intelligent relatives, Long Short-Term Memory networks (LSTMs). These models processed information step by step, such as reading one word at a time.
Suppose you have to recall a very long shopping list. You may forget the items at the start by the time you get to the end. RNNs also suffered from the same issue known as the “vanishing gradient problem.” This made it difficult for them to recall information from the past in a long sequence when they were working on the later parts of the sequence. LSTMs were more capable of recalling, but they still had to process the sequence step by step, which could be time-consuming, particularly for extremely long passages of text.
Another problem was that these older models weren’t always sure which words in a sentence were most significant to understanding its meaning. For instance, in the sentence “The dog chased the ball,” the words “dog” and “ball” are more significant to understanding the action than the word “the.” Older models didn’t have a good way to automatically determine this.
This is where the concept of “attention” entered, which is the magic ingredient that makes Transformers work so well.
The “Paying Attention” Trick: Paying Attention to What Matters
The core concept in Transformers is something referred to as the “attention mechanism.” Consider when you’re reading a sentence. Your mind does not treat all the words as of equal importance at every instant. Rather, it pays greater attention to words which are most critical to understanding the particular word at hand.
For instance, when you’re reading “The big yellow bus drove down the street,” your eye might naturally glance back to “bus” on reading the word “drove” in order to know who is driving. The Transformer model works in the same way. For each word it’s reading, it is looking at all the other words in the sentence and determining how much “attention” to pay to each. This assists it in understanding the connections between various words, regardless of how far they are from each other in the sentence. It is as if the model can visually see which words are related and affect one another.
How a Transformer is Built: The Different Parts Working Together
A Transformer model possesses a certain structure composed of two primary components: the encoder and the decoder. Consider the encoder as the component that reads and comprehends the input, and the decoder as the component that uses that comprehension and produces the output.

The Encoder: Reading and Understanding
The encoder receives the input (such as a sentence) and converts it into a unique type of code that the model can read. It consists of multiple layers one on top of the other, and each performs a specific task:
Multi-Head Self-Attention: This is where the “paying attention” magic actually occurs. Suppose you’re reading a sentence, and you have a few highlighters of different colors. Each highlighter emphasizes a different type of relationship between the words. For instance, one may emphasize words that are grammatically related, while another emphasizes words with similar meanings. “Multi-head” attention is similar to having several such highlighters operating simultaneously, so that the model can comprehend different types of relationships between the words in the input. “Self-attention” refers to the fact that the model is examining the connections between the words in the same input sentence.
Feed-Forward Network: Following the self-attention component, the information for each word is fed through another tiny processing unit known as a “feed-forward network.” This serves to further fine-tune the model’s comprehension of each word in relation to the whole sentence.
Even before the sentence is sent into the encoder, every word is converted into a special number known as an embedding. An embedding is like a secret code that contains the meaning of the word. Words with close meanings will have close codes. Because the Transformer sees all the words simultaneously, it also has to know what order the words are in. This is achieved with something called positional encoding, which gives a bit of additional information to each word’s embedding to inform the model where it is in the sentence. It is like putting a timestamp on each word.
The Decoder: Generating the Output
The decoder takes the meaning from the encoder and applies it to produce the output (such as a translation or a response to a question). It, too, has a number of layers, like the encoder, but a couple of differences:
Masked Multi-Head Self-Attention: It’s like the self-attention in the encoder, but with a “mask.” Think of the decoder attempting to write a sentence word for word. When it’s choosing what the next word is, it only needs to look at words that it has already written, not future words. The “mask” stops the decoder from “cheating” and looking ahead.
Multi-Head Attention on Encoder Output: It’s here that the decoder re-establishes contact with the initial input. For every word it’s attempting to produce, it considers the encoded form of the input sentence (from the encoder) and chooses what aspects of the input are most suitable for producing the current word output. It’s as if the decoder is always looking back at the original question or sentence to ensure its response or translation is sensible.
Feed-Forward Network: Similar to the encoder, the output of the attention layers is then processed by a feed-forward network.
Lastly, the output from the final decoder layer passes through a final step to determine which word should be the next word in the output sequence.
How it Works: A Simple Example of Translation
Suppose we wish to translate the simple sentence “Hello” from English to French with a Transformer.
Input and Embedding: The input is the word “Hello”. It is converted into a numerical embedding, and positional encoding is added (although for a single word, this doesn’t matter much).
Encoder Processing: The encoded “Hello” passes through the encoder layers. The self-attention mechanism, even with one word, gets its representation ready. The feed-forward network further processes it. The encoder now has a good idea of the input.
Decoder Processing: The decoder begins attempting to produce the French word.
The masked self-attention (initially empty since no French words have been produced yet).
The focus over the encoder output considers the encoded form of “Hello” and determines the most probable French translation. It comes to know that “Hello” usually translates into “Bonjour.”
The feed-forward network computes this.
Output: The decoder produces the word “Bonjour.”
For a sentence that is longer, such as “The cat sat on the mat,” it’s a bit more complicated, but it is basically the same idea. The encoder hears the entire English sentence and considers all of the word relations. Next, the decoder uses that information to construct the French sentence “Le chat était assis sur le tapis” word by word while keeping the original English sentence in mind.
Why are Transformers So Good? The Benefits
Transformers have gotten extremely popular because they provide some significant benefits over previous models:
They can see everything at once: This implies that they can get the context of a word from all the other words in the sentence, even if they are far away. This resolves the “long-range dependency” issue.
They can compute things in parallel: Unlike older versions that were limited to computing a word at a time, one after the next, Transformers have the ability to perform lots of computations simultaneously. This means that they are many times faster and more efficient for long documents.
They are extremely good at learning relationships: The attention mechanism enables them to learn directly how words in a sentence relate to one another, resulting in an improved understanding of the meaning.
Where are Transformers Used? Real-World Examples
Transformers are now being utilized in many various AI applications that you might even be using daily:
Language Translation: Software such as Google Translate employs Transformers to translate languages much more precisely and naturally.
AI Writing Assistants: If you’ve ever had an AI assist you with writing emails or articles, chances are it’s driven by a Transformer model. These models are capable of writing text that sounds surprisingly human.
Chatbots and Virtual Assistants: Most modern chatbots and virtual assistants employ Transformers to listen to your questions and give useful answers.
Search Engines: Search engines such as Google utilize Transformers to comprehend the intent behind your search queries more effectively and deliver more appropriate results.
Reading Documents: Transformers may be applied to read and comprehend large documents, summarize them, or respond to questions about their content.
Even in Vision! Amazingly, the concepts of Transformers are even being utilized in computer vision to enable AI to comprehend images.
What’s Next for Transformers? The Future is Bright
The AI world is ever-evolving, and scientists are always trying to improve Transformers even further. Some of the areas they are concentrating on are:
Making them more efficient and faster: Although Transformers are already very powerful, they can still be slow and use a lot of computing resources for extremely long texts. Scientists are trying to find ways to make them faster and consume fewer resources.
Making them more transparent: Occasionally, it can be difficult to explain precisely why a Transformer is making a specific choice. Researchers are trying to make these models more transparent so we can see how they operate.
Discovering new applications for them: Researchers are continually finding new and interesting applications for Transformers to solve different problems, ranging from robotics to medicine.
Conclusion: A Giant Leap Forward in AI Insight
Transformers are a genuine breakthrough in Artificial Intelligence. They’ve made computers capable of reading and writing human language in ways that were hitherto unthinkable. By the brilliant application of “attention,” Transformers can focus on the most salient parts of a sequence and understand how they relate to each other, and so are versatile tools for a wide range of applications. As AI grows stronger, Transformers will have an increasingly important place in the way computers interact with and understand the world around us.
This should provide a good basic sense of what exactly a Transformer is and why it’s causing all the fuss in the world of AI. If you want to continue learning more, you can read the original research paper “Attention Is All You Need” or browse one of the many online tutorials and resources available out there. You’ll find this apparently complex topic has some very intuitive ideas around how we as humans understand language ourselves!