The Road to Transformers

Discover how Large Language Models (LLMs) work under the hood.

Apr 21, 2025

Introduction

The road that led to the invention of the Transformer architecture and, contextually, to the birth of Large Language Models (LLMs) as we know them, started from far.
In this tutorial, we are going to walk that road once again, one step at a time, to get a more profound understanding of how these complex machines work and why they are so effective.
The idea came from the Natural Language Processing (NLP) research community, which was dealing, at the time, with one interesting task: Machine Translation (MT). In other words, they were investigating how to translate a sentence algorithmically from one language to another.
Up to that moment (it was 2014), Neural Networks (NNs) were used for some NLP tasks but were not yet providing state-of-the-art (SotA) results on MT. Things were about to change, though.
Kyunghyun Cho and his colleagues from the University of Montreal released a paper on arXiv [1] discussing how Recurrent Neural Networks (RNNs) could be used to address sequence-to-sequence learning problems, such as those in MT. In particular, they presented the first encoder-decoder architecture and explained how to train it to handle input sequences of variable length and generate output sequences of variable length.
After this first achievement, also some Google researchers published an important paper on the topic [2], and RNNs started to gain traction and be extensively utilized in NLP, becoming the de facto standard for most of the tasks in this field.
Nevertheless, we now know that RNNs are a thing of the past, at least for complex NLP applications, but when did that shift started happening?
Shockingly enough, the seed for the RNNs reign destruction was planted the same year the reign was established, in 2014, and it was Cho himself who planted it. In fact, he was one of the inventors of the attention mechanism, presented for the first time in [3].
After that milestone, though, it took a few years for the community to understand how to leverage the power of the attention mechanism while getting rid of the recurrent architecture.
Some tried with Convolutional Neural Networks (see [4]), but then came Google’s researchers, and in 2017, they published the first version of one of the most famous and impactful papers of recent years: attention is all you need [5]. This is the very first appearance of the Transformer architecture, and at the end of 2022, OpenAI shocked everybody by releasing ChatGPT.
The rest is history.
Up to this point we simply flew over the road to transformers; it is now time to start walking it and dive deep into why they took over the world of NLP. We’ll begin with an introduction about RNNs, their drawbacks, and how to train them to perform sequence-to-sequence transduction tasks. Thereafter, we’ll explore how the attention mechanism and the Transformer architecture contributed to overcoming the previously highlighted limitations. In particular, the remainder of this blog post is organized as follows:

The Basics of Recurrent Neural Networks

A generic RNN is composed of neurons organized in layers, just like a Feed Forward Neural Network (FFNN). The key difference lies in the inner workings of these neurons and the way they are represented, both mathematically and graphically.
In fact, because of the strong interdependence between recurrent neurons in a given layer, it is customary to describe them collectively as a single entity, referred to as a unit. Recurrent units are meant to process sequential data one element at a time while jointly updating an internal memory, referred to as the state, and optionally producing an output. The state of the unit is simply the collection of the states of its neurons, which are tightly coupled and influence each other; hence the need to represent them as a whole to properly capture their interaction.
Another quirk of RNNs has to do with multi-layered architectures, composed of a cascade of multiple recurrent units, which are commonly referred to as stacked networks.
It is worth pointing out that the world unit is sometimes used differently. For instance, in libraries such as TensorFlow or PyTorch, when instantiating a recurrent layer of some sort, the number of units indicates the number of neurons of that layer.
Let us make the discussion more formal by introducing some notation. Suppose we have a finite-length input sequence X={xt}tTX\textbf{X} = \{ \textbf{x}_t \}_{t \in \mathcal{T}_X}, where TX={1,2,,TX}\mathcal{T}_X = \{ 1, 2, \dots, T_X \} and xtRmtTX\textbf{x}_t \in \mathbb{R}^m \,\, \forall \,\, t \in \mathcal{T}_X.
The state of a recurrent unit at a given time t>0t > 0 is denoted by htRl\textbf{h}_t \in \mathbb{R}^l, and its evolution is governed by a parametric mapping f(,;θ):Rl×RmRlf(\cdot, \cdot; \mathbf{\theta}): \mathbb{R}^l \times \mathbb{R}^m \rightarrow \mathbb{R}^l as follows:
ht=f(ht1,xt;θ),\textbf{h}_t = f(\textbf{h}_{t-1}, \textbf{x}_t; \mathbf{\theta}),
where θRp\mathbf{\theta} \in \mathbb{R}^p is the parameters vector, and we assume the state to be initialized to h0Rl\textbf{h}_0 \in \mathbb{R}^l.
The output is still a finite-length sequence Y={yt}tTY\textbf{Y} = \{\textbf{y}_t\}_{t \in \mathcal{T}_Y}, with ytRntTY\textbf{y}_t \in \mathbb{R}^n \,\, \forall \,\, t \in \mathcal{T}_Y, indexed over a (potentially different) time set TY={1,2,,TY}\mathcal{T}_Y = \{ 1, 2, \dots, T_Y \}. Its evolution is also governed by a parametric mapping g(;γ):RlRng(\cdot; \mathbf{\gamma}): \mathbb{R}^l \rightarrow \mathbb{R}^n such that:
yt=g(ht;γ),\textbf{y}_t = g(\textbf{h}_t; \mathbf{\gamma}),
where γRq\mathbf{\gamma} \in \mathbb{R}^q represents the parameters vector.
All in all, at any time step, one can interpret a generic recurrent unit as a collection of parametric mappings of the form:
rh(;θ,γ):RmRn,r_h(\cdot; \mathbf{\theta}, \mathbf{\gamma}): \mathbb{R}^m \rightarrow \mathbb{R}^n,
each corresponding to the composition of the two parametric mappings defined above, for a fixed state hRl\textbf{h} \in \mathbb{R}^l:
rh(;θ,γ)=g(f(h,;θ);γ),r_h(\cdot; \mathbf{\theta}, \mathbf{\gamma}) = g(f(\textbf{h}, \cdot; \mathbf{\theta}); \mathbf{\gamma}),
Alternatively, rather than focusing on a single time step, one can model the recurrent unit at a higher level of abstraction as an autoregressive parametric transducer that maps sequences of variable (but finite) length into sequences of variable (but finite) length:
r(,h0;θ,γ):(Rm)(Rn),r(\cdot, \textbf{h}_0; \mathbf{\theta}, \mathbf{\gamma}): (\mathbb{R}^m)^* \rightarrow (\mathbb{R}^n)^*,
where (A)(A)^* for a generic set AA corresponds to the collection of all finite-length sequences of elements in it, defined as (A)=T=0(A)T(A)^* = \bigcup\nolimits_{T=0}^{\infty}(A)^T, with (A)T(A)^T being the collection of all TT-length sequences of elements in AA.
Now that we've completed an initial overview of vanilla recurrent units, we can introduce a simple taxonomy that classifies RNNs into three main categories based on the lengths of their input and output sequences. That said, it's worth mentioning that more sophisticated types of recurrent units also exist, such as Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) units, which implement more advanced memorization mechanisms and require additional parameters and hyperparameters to be fully specified.

Many-to-One

These types of networks are meant to process an input sequence with more than one element (i.e., TX>1T_X > 1), and produce a length-one output sequence (i.e., TY=1T_Y = 1).
Considering a trivial single-unit RNN, the processing mechanism (represented below) can be summarized in two steps:
  1. The whole input sequence is fed into the unit one element at a time, causing multiple updates of its state.
  1. Once the last element of the input sequence has been fed into the network, the output is generated based on the unit's latest state.
Many-to-One Single-Unit RNN Architecture (Unfolded Representation)
Many-to-One Single-Unit RNN Architecture (Unfolded Representation)
The same idea can be generalized to stacked architectures. In this case, the network still processes the input sequence one element at a time, and a given unit processes the output of the previous one and propagates its output to the next unit. Once the entire input sequence has been processed and multiple state updates of all the units in the network have been produced, the output is generated based on the latest state of the last unit.

One-to-Many

These types of networks are meant to process a length-one input sequence (i.e., TX=1T_X = 1) and produce an output sequence with more than one element (i.e., TY>1T_Y > 1).
Considering again a trivial RNN with a single recurrent unit, the processing mechanism (represented below) is based on feeding the unit's outputs back as inputs to produce multiple state updates and, consequently, multiple outputs from a length-one input sequence. All in all, what happens is the following:
  1. The only element of the input sequence is fed into the unit, causing a first state update and the generation of the first element of the output sequence.
  1. From that moment on, the latest generated element of the output sequence is fed back into the unit as input, producing an additional state update and the generation of an additional element of the output sequence.
  1. The process is repeated a certain number of times until the conditions of some stopping criterion are met.
One-to-Many Single-Unit RNN Architecture (Unfolded Representation)
One-to-Many Single-Unit RNN Architecture (Unfolded Representation)
As for many-to-one RNNs, and based on the same principles, the above idea can be generalized to stacked architectures. There is one catch, though: in many-to-one RNNs, each unit can begin processing the next input element as soon as it finishes the current one. In contrast, in one-to-many RNNs, the output must first propagate through the entire network and be fed back before computations can proceed.

Many-to-Many

These types of networks are meant to process an input sequence with more than one element and produce an output sequence with more than one element.
The two sequences don't need to be the same length, but the basic architecture may change depending on whether they do. In fact, in case the two sequences have different lengths, the simplest many-to-many RNN one can think of may have (i) a many-to-one encoder and (i) a one-to-many decoder, connected back-to-back, and both consisting of a single recurrent unit. In case the two sequences have the same length, instead, there is no need to have these two separate processing blocks.
Many-to-Many Encoder-Decoder RNN Architecture (Unfolded Representation)
Many-to-Many Encoder-Decoder RNN Architecture (Unfolded Representation)
Many-to-Many Single-Unit RNN Architecture (Unfolded Representation)
Many-to-Many Single-Unit RNN Architecture (Unfolded Representation)

The Drawbacks of Recurrent Neural Networks

RNNs are suitable for handling sequential data by design and are capable of processing sequences of any length, which is extremely useful in many applications. Nevertheless, they present some intrinsic limitations:
  • Limited Memory. Relying on memory is nice when dealing with sequential data, but what happens when the sequence is too long? RNNs struggle to capture long-term complex relationships between elements of the input sequence, which negatively impacts the performance quality-wise.
  • Sequential Processing. One thing is processing sequential data; another thing is doing it sequentially. RNNs crunch one element of the input sequence at a time, which negatively impacts the performance in terms of processing time.
As it happens most of the time, limitations drive innovation, and this is precisely what happened for sequence processing, as we'll discuss in the following paragraphs.

Sequence to Sequence Transduction

As we mentioned before, the first attempts to use RNNs for MT tasks were made in 2014 by different groups of researchers. The Neural MT Systems they presented were based on many-to-many RNNs, with architectures consisting of an encoder and a decoder jointly trained on a specific language pair.
Generic Encoder-Decoder Architecture for MT
Generic Encoder-Decoder Architecture for MT
We have the basics of how such networks operate, but to understand how they are actually used for sequence-to-sequence transduction tasks like MT, we need to dive deeper. Based on the frameworks presented in [1] and [2], we’ll explore the topic in three steps: (i) we describe the basic architecture of a neural transducer (modeling), (ii) we explain how to learn its parameters (training), and (iii) we show how to use it after training (inference).
As we are going to see, while the modeling changes over time (e.g., the attention mechanism is invented, Transformers take over RNNs), both training and inference strategies stay more or less the same.

Modeling

The neural transducers we are interested in are a generalization of those seen so far, and we need to establish some notation first:
  • Input Sequence. X={xt}tTX\textbf{X} = \{\textbf{x}_t\}_{t \in \mathcal{T}_X}, with TX={1,2,,TX}\mathcal{T}_X = \{ 1, 2, \dots, T_X \} and xtRmtTX \textbf{x}_t \in \mathbb{R}^m \,\, \forall \,\, t \in \mathcal{T}_X.
  • Output Sequence. Y^={y^t}tTY\hat{\textbf{Y}} = \{\hat{\textbf{y}}_t\}_{t \in \mathcal{T}_Y}, with TY={1,2,,TY}\mathcal{T}_Y = \{1, 2, \dots, T_Y \} and y^tRntTY\hat{\textbf{y}}_t \in \mathbb{R}^n \,\, \forall \,\, t \in \mathcal{T}_Y.
  • Encoder State Evolution. H={ht}tTX\textbf{H} = \{\textbf{h}_t\}_{t \in \mathcal{T}_X}, with htRltTX{0}\textbf{h}_t \in \mathbb{R}^l \,\, \forall \,\, t \in \mathcal{T}_X \cup \{0\}.
  • Decoder State Evolution. S={st}tTY\textbf{S} = \{\textbf{s}_t\}_{t \in \mathcal{T}_Y}, with stRktTY{0}\textbf{s}_t \in \mathbb{R}^k \,\, \forall \,\, t \in \mathcal{T}_Y \cup \{0\}.
We are now ready to discuss the encoder-decoder interplay:
  • Encoder. It processes the whole input sequence and produces an intermediate representation cRd\textbf{c} \in \mathbb{R}^d, known as the context vector. The evolution of the encoder state over time, which occurs during the encoding process, is governed by a parametric mapping f(,;θ):Rl×RmRlf(\cdot, \cdot; \mathbf{\theta}): \mathbb{R}^l \times \mathbb{R}^m \rightarrow \mathbb{R}^l, such that ht=f(ht1,xt;θ)\textbf{h}_t = f(\textbf{h}_{t-1}, \textbf{x}_t; \mathbf{\theta}), and θRp\mathbf{\theta} \in \mathbb{R}^p. Unlike generic many-to-one units where the output depends solely on the final state, here the context vector is computed as a function of the entire sequence of encoder states. Formally, this can be written as c=e(H)\textbf{c} = e(\textbf{H}), where e:(Rl)Rde: (\mathbb{R}^l)^* \rightarrow \mathbb{R}^d, and (Rl)(\mathbb{R}^l)^* denotes the set of all finite-length sequences of elements in Rl\mathbb{R}^l. It is worth mentioning that the aforementioned mapping could be, and in fact usually is, parametrized, and its parameters can be learned at training time.
  • Decoder. It generates the output sequence one element at a time by recursively processing the context vector together with the previously generated output. Again, observe a difference compared to the generic one-to-many unit described earlier: the decoder state depends on an additional input, the context. Its evolution during the decoding process thus is governed by a parametric mapping z(,,;λ):Rk×Rn×RdRkz(\cdot, \cdot, \cdot; \mathbf{\lambda}): \mathbb{R}^k \times \mathbb{R}^n \times \mathbb{R}^d \rightarrow \mathbb{R}^k, such that st=z(st1,y^t1,c;λ)\textbf{s}_t = z(\textbf{s}_{t-1}, \hat{\textbf{y}}_{t-1}, \textbf{c}; \mathbf{\lambda}), and λRw\mathbf{\lambda} \in \mathbb{R}^w. The elements of the output sequence are then produced by another parametric mapping g(;γ):RkRng(\cdot; \mathbf{\gamma}): \mathbb{R}^k \rightarrow \mathbb{R}^n such that y^t=g(st;γ)\hat{\textbf{y}}_t = g(\textbf{s}_t; \mathbf{\gamma}), and γRq\mathbf{\gamma} \in \mathbb{R}^q.
Attention-Less Encoder-Decoder Architecture for Sequence-to-Sequence Transduction (Unfolded Representation)
Attention-Less Encoder-Decoder Architecture for Sequence-to-Sequence Transduction (Unfolded Representation)

Training

At this point, we are ready to see how such a model is trained and used to perform MT tasks or, more generally, sequence-to-sequence transduction tasks.
We begin with the dataset, of course, which consists of ordered pairs of finite-length sequences, the first representing the source (transducer's input) and the second the target (expected transducer's output):
D={(X(i),Y(i))(Rm)×(Rn):i=1,,D}.\mathcal{D} = \{(\textbf{X}^{(i)}, \textbf{Y}^{(i)}) \in (\mathbb{R}^m)^* \times (\mathbb{R}^n)^* \,\,:\,\, i=1, \dots, D\}.
Depending on the specific application, the samples have different meanings. In MT, for example, where the goal is to translate sentences from one language to another, every pair in the training dataset contains the numerical representations of a sentence in two different languages. Such numerical representations are obtained by applying the following pre-processing procedure to each sentence:
  1. Tokenization. The sentence is split into predefined strings of characters (e.g., syllables, words) called tokens.
  1. Encoding + Embedding. Each token is first associated with a number (or a sparse vector) and then transformed into a dense vector that captures semantics.
While we will not go into the technical details of tokenization, encoding, and embedding, we highlight two key points:
  • Although the set of possible tokens (i.e., the vocabulary) can be large, it is still finite.
  • The embedding phase can be omitted during pre-processing, allowing the network to learn the embeddings jointly with the main task by directly operating on the encoded tokens. In this case, a dedicated embedder with learnable parameters is placed at the input stage of both the encoder and the decoder.
We keep going by talking about the learning problem formulation, which requires (i) the definition of an objective (i.e., loss function) and (ii) a detailed explanation of how it is connected to the task at hand. As always happens in Supervised Machine Learning (ML), the loss function depends on both the parameters and the dataset, and it tries to capture how much, on average, the model's actual output differs from the expected one:
L(Θ;D)=1Di=1DL~i(Θ)=1Di=1DL~(Y(i),Y^(i)(Θ)),\mathcal{L}(\mathbf{\Theta}; \mathcal{D}) = \frac{1}{D}\sum_{i=1}^D \tilde{\mathcal{L}}_i(\mathbf{\Theta}) = \frac{1}{D}\sum_{i=1}^D \tilde{\mathcal{L}}(\textbf{Y}^{(i)}, \hat{\textbf{Y}}^{(i)}(\mathbf{\Theta})),
where L~:(Rn)×(Rn)R0+\tilde{\mathcal{L}}: (\mathbb{R}^n)^* \times (\mathbb{R}^n)^* \rightarrow \mathbb{R}_0^+ is the sample loss. At training time, we treat this loss as a function of Θ=(θ,λ,γ)\mathbf{\Theta} = (\mathbf{\theta}, \mathbf{\lambda}, \mathbf{\gamma}), a triplet that collects all the trainable parameters seen so far.
That said, it is worth pointing out that there may be additional parameters beyond these. For instance, as previously mentioned, the function responsible for generating the context vector from the encoder states may also be parameterized, or we could be using more sophisticated recurrent units (e.g., GRUs or LSTMs).
To go one step further in specifying the loss, we must first clarify what the network is actually expected to output at every step, which depends on the sequence-to-sequence task at hand. More specifically, whether the task is modeled as a classification or a regression problem determines the cardinality of the set to which the elements of the output sequences belong, thus impacting the choice of the output activation and that of the loss function itself.
In the case of MT, as mentioned above, the output vocabulary size is finite regardless of the specific tokenization technique used. Therefore, it is common to model the generation of the output sentence as a sequence of classification tasks, where the network recursively predicts the next token at each time step.
In this setting, the RNN is typically designed (and trained) to produce sequences of probability distributions over the output vocabulary. The actual tokens are then sampled, either stochastically or deterministically, from these distributions, both during training (if the training strategy requires it) and at inference time.
To this end, the dimension of the final layer is set to be equal to the size of the output vocabulary V\mathcal{V}, so that each component corresponds to one of the possible tokens. Moreover, the unnormalized scores this last layer produces, called logits and denoted as vRn\textbf{v} \in \mathbb{R}^n, are normalized via a softmax activation function σ:Rn[0,1]n\sigma: \mathbb{R}^n \rightarrow [0,1]^n to obtain a valid vectorized probability distribution y^[0,1]n\hat{\textbf{y}} \in [0,1]^n, with j=1ny^[j]=1\sum\nolimits_{j=1}^{n}\hat{\textbf{y}}[j] = 1, and elements given by:
y^[j]=σj(v)=ev[j]k=1nev[k].\hat{\textbf{y}}[j] = \sigma_j(\textbf{v}) = \frac{e^{\textbf{v}[j]}}{\sum_{k=1}^n e^{\mathbf{v}[k]}}.
The predicted probability of a generic token y\textbf{y}^* with index jj^*, which implicitly depends on (i) the previous elements in the output sequence, (ii) the whole input sequence, and (iii) the model parameters, can thus be written as P(y)=y^[j]\mathbb{P}(\textbf{y}^*) = \hat{\textbf{y}}[j^*]. Therefore, sampling from the output distribution is equivalent to selecting an index in the output vocabulary, which in turn corresponds to a specific token.
The standard objective during training for a generic input-output pair is to maximize the probability of the ground truth output sequence conditioned on the input sequence P(YX;Θ)\mathbb{P}(\textbf{Y} | \textbf{X}; \mathbf{\Theta}). Due to the autoregressive nature of the output generation process, this probability can be factorized as:
P(YX;Θ)=t=1TYP(ytY<t,X;Θ),\mathbb{P}(\textbf{Y} | \textbf{X}; \mathbf{\Theta}) = \prod_{t=1}^{T_Y}\mathbb{P}(\textbf{y}_t | \textbf{Y}_{<t}, \textbf{X}; \mathbf{\Theta}),
where y<t\textbf{y}_{<t} denotes the sequence of output tokens up to (but not including) time tt, and the dependency on the model's parameters is explicitly expressed.
To write down the sample loss, we apply the negative logarithm to the expression above, obtaining:
L~(Θ)=t=1TYlog[P(ytY<t,X;Θ)].\tilde{\mathcal{L}}(\mathbf{\Theta}) = -\sum_{t=1}^{T_Y}\log \left[ \mathbb{P}(\textbf{y}_t | \textbf{Y}_{<t}, \textbf{X}; \mathbf{\Theta}) \right].
We are interested in formulating our learning task as a minimization problem, and the negative logarithm allows us to do just that. In fact, being the logarithm a strictly increasing function, it preserves the location of optima while transforming the product into a sum, which is more tractable to optimize. All in all, we can thus write:
L(Θ;D)=1Di=1DL~i(Θ)=1Di=1Dt=1TYlog[P(yt(i)Y<t(i),X(i);Θ)].\mathcal{L}(\mathbf{\Theta}; \mathcal{D}) = -\frac{1}{D}\sum_{i=1}^D\tilde{\mathcal{L}}_i(\mathbf{\Theta}) = -\frac{1}{D}\sum_{i=1}^D\sum_{t=1}^{T_Y}\log \left[ \mathbb{P}(\textbf{y}_t^{(i)} | \textbf{Y}_{<t}^{(i)}, \textbf{X}^{(i)}; \mathbf{\Theta}) \right].
In principle, at every time step, computing each of these probabilities is equivalent to selecting the component of the output that corresponds to the current ground truth token. To do so while preserving differentiability, we represent the true output tokens as one-hot vectors, meaning that V{0,1}n\mathcal{V} \subseteq \{0, 1\}^n and i=1ny[i]=1yV\sum\nolimits_{i=1}^n\textbf{y}[i] = 1 \,\,\, \forall \,\,\, \textbf{y} \in \mathcal{V}, with the only non-zero entry located at the position matching the token index in the vocabulary. This way, its representation can be interpreted as a valid probability distribution, and we can compare it with the one predicted by the network using cross-entropy H:[0,1]n×[0,1]n[0,1]\mathcal{H}: [0, 1]^n \times [0, 1]^n \rightarrow [0, 1] as follows:
H(y,y^)=j=1Ny[j]log[y^[j]].\mathcal{H}(\textbf{y}, \hat{\textbf{y}}) = - \sum_{j=1}^{N} \textbf{y}[j] \log\left[\hat{\textbf{y}}[j]\right].
Because of the one-hot encoding, for any generic ground truth token y\textbf{y}^* with index jj^*, the expression simplifies to the negative log probability we need. Keeping the conditioning variables implicit to avoid notation clutter, we get:
H(y,y^)=log[y^[j]]=log[P(y)].\mathcal{H}(\textbf{y}^*, \hat{\textbf{y}}) = - \log\left[\hat{\textbf{y}}[j^*]\right] = - \log\left[\mathbb{P}(\textbf{y}^*)\right].
We have discussed how to define an objective that effectively translates the learning goal into an optimization problem, but there are two missing pieces to understanding how to solve it through a computational procedure, and once again, they are consequences of the autoregressive nature of the selected approach.
1. The decoder needs a feedback signal to keep generating output tokens. At inference time we can't do anything else than use its outputs, but what do we do at training time?
The most common strategy is teacher forcing, where the previous ground truth token is fed into the model at each time step. This helps the model learn more efficiently by preventing error accumulation and avoids the need to sample from the generated output distributions. At the other end of the spectrum lies free-running training, where the model uses its predictions during training as it would at inference time, a setup that better reflects the actual usage but can suffer from instability early on. Intermediate strategies, such as scheduled sampling, gradually transition from teacher forcing to free-running by occasionally replacing the actual token with the model's prediction, striking a balance between stability and realism.
2. In general, we are dealing with input and output sequences of arbitrary lengths. How do we know when to stop the encoding and the decoding processes?
The standard solution is to append an end-of-sequence token to each output sequence in the training dataset, thereby ensuring it is also included in the output vocabulary. During training, it can be used explicitly to signal the decoder when to stop. At inference time, generation halts whenever this token is sampled. A similar approach is applied to the input sequence, though it’s even simpler: since we have full control over the input, the end-of-sequence token can be consistently appended and used to terminate the encoding process at both training and inference time.
Training-Time Functioning of RNN-Based Encoder-Decoder Architecture with Teacher Forcing
Training-Time Functioning of RNN-Based Encoder-Decoder Architecture with Teacher Forcing

What we have seen so far represents the go-to approach for MT and other applications where the output generation happens through a sequence of classifications. In other applications, however, the output sequence may not consist of tokens drawn from a finite vocabulary but rather of continuous values. In such cases, the generation process is typically modeled as a succession of regression tasks, and a different output activation function (e.g., linear) as well as a different sample-level loss function (e.g., mean square error), are used. This also implies that ad hoc decoding stopping criteria must be defined for inference, such as a maximum output length or application-specific heuristics, since no special end-of-sequence token is available.

Inference

Once the model is trained, it can be used as a stochastic autoregressive next-token generator that, given an input sequence, iteratively performs the following steps: (i) produces a probability distribution over the output vocabulary, (ii) samples it to generate an output token, and (iii) feeds the token back into the decoder to initiate the next cycle.
Inference-Time Functioning of RNN-Based Encoder-Decoder Architecture
Inference-Time Functioning of RNN-Based Encoder-Decoder Architecture
To fully grasp the output generation process, we need to understand how sampling works and how decoding can be controlled and adapted to suit the specific application at hand.
The simplest approach is greedy decoding, which always picks the most probable token, basically removing every source of randomness from the system (except for the initial states of the units, which are usually fixed anyway). A more sophisticated strategy that leverages randomness, instead, is top-k sampling that selects the k most probable tokens, renormalizes them, and samples the next one from this reduced-size distribution, implicitly encouraging diversity in the output while discarding unlikely and thus potentially noisy tokens. Another stochastic strategy is top-p sampling, which differs from top-k sampling only in the criterion used to select the set of tokens to sample from. In particular, rather than always picking the same number of tokens, it considers the smallest set whose cumulative probability exceeds a given threshold p, thus enforcing more flexibility.
When using top-k sampling, top-p sampling, or any other stochastic sampling method, it is common to apply a uniform pre-sampling scaling of the logits using a hyperparameter called temperature. Due to the high non-linearity of the exponential function, this results in a controlled warping of the output distribution.
Additionally, post-sampling techniques can also be applied, such as beam search, which expands multiple candidate sequences in parallel by generating, at each step, several possible continuations, and at the end of the decoding process selects the best based on some scoring criterion, such as the highest joint probability.

The Attention Mechanism

It took some effort, but we are now ready to discuss the attention mechanism in both the original and the modern formulation, and given all the background we have, understanding its purpose and functioning will be easier than expected.

Original Formulation

The very same year RNNs began to be used for sequence-to-sequence transduction tasks, the attention mechanism was introduced to address the limited memory issue. The idea is simple: encoding the input sequence into a fixed context vector might be straightforward, but it doesn't scale well, especially when the input is long and rich in information.
Therefore, in [3], Cho and his colleagues proposed a workaround: instead of relying on a single fixed context, they recomputed a different one at each decoding step, tailoring it to the specific output token being generated. Alongside this, they also introduced a bidirectional encoder, which processes the input sequence in both directions, from the first token to the last (forward encoder) and vice versa (backward encoder), thus producing two sequences of hidden states.
Encoder-Decoder Architecture with Attention for Sequence-to-Sequence Transduction (Unfolded Representation)
Encoder-Decoder Architecture with Attention for Sequence-to-Sequence Transduction (Unfolded Representation)
The math changes a bit, especially on the encoder side, and we need to redefine a few things.
  • Forward Encoder State Evolution. H={ht}tTX\textbf{H}' = \{\textbf{h}_t'\}_{t \in \mathcal{T}_X}, with htRltTX{0}\textbf{h}_t' \in \mathbb{R}^{l'} \,\, \forall \,\, t \in \mathcal{T}_X \cup \{0\}.
  • Backward Encoder State Evolution. H={ht}tTX\textbf{H}'' = \{\textbf{h}_t''\}_{t \in \mathcal{T}_X}, with htRltTX{TX+1}\textbf{h}_t'' \in \mathbb{R}^{l''} \,\, \forall \,\, t \in \mathcal{T}_X \cup \{T_{X+1}\}.
  • Annotations Sequence. H={ht}tTX\textbf{H} = \{\textbf{h}_t\}_{t \in \mathcal{T}_X}, with ht=[ht;ht]RltTX\textbf{h}_t = \left[\textbf{h}_t' ; \textbf{h}_t'' \right] \in \mathbb{R}^l \,\, \forall \,\, t \in \mathcal{T}_X, and l=l+ll = l' + l''.
  • Contexts Sequence. C={ct}tTY\textbf{C} = \{\textbf{c}_t\}_{t \in \mathcal{T}_Y}, with ctRdtTY\textbf{c}_t \in \mathbb{R}^{d} \,\, \forall \,\, t \in \mathcal{T}_Y.
Each annotation in the annotations sequence, as it is called in the original paper, is obtained by concatenating the forward and backward encoder states corresponding to a given input element, and it carries information about both what comes before and what comes after it in the input sequence, with a particular emphasis on nearby elements in both directions.
These annotations play a pivotal role during decoding. On the one hand, the decoder operates as in the baseline approach discussed earlier: its state at each step depends on the previous state and the context at that step. On the other hand, the context is no longer fixed, and its evolution over time is governed by a parametric mapping e(,;α):(Rl)×RkRde(\cdot, \cdot; \alpha): (\mathbb{R}^l)^* \times\mathbb{R}^k \rightarrow \mathbb{R}^d, such that ct=e(H,st1;α)\textbf{c}_t = e(\textbf{H}, \textbf{s}_{t-1}; \mathbb{\alpha}), and αRu\mathbf{\alpha} \in \mathbb{R}^u.
And here we are: this dynamic behavior of the context, adapting as the decoder generates the output sequence, is the attention mechanism!
Under the hood, the aforementioned mapping implements a weighted sum of the annotations, where the unnormalized weights, known as attention scores, are computed from the previous decoder state and the annotations themselves using another parametric function a(,;α):Rl×RkRa(\cdot, \cdot; \mathbf{\alpha}): \mathbb{R}^l \times\mathbb{R}^k \rightarrow \mathbb{R}, referred to as the alignment model. They are then normalized with a softmax function to obtain the attention weights and used to produce the context vector. All in all, we can formalize the attention mechanism as follows:
ct=e(H,st1;α)=i=1TX(ea(hi,st1;α)j=1TXea(hj,st1;α))hi.\textbf{c}_t = e(\textbf{H}, \textbf{s}_{t-1}; \mathbf{\alpha}) = \sum_{i=1}^{T_X} \left(\frac{e^{a(\textbf{h}_i, \textbf{s}_{t-1}; \mathbf{\alpha})}}{\sum_{j=1}^{T_X} e^{a(\textbf{h}_j, \textbf{s}_{t-1}; \mathbf{\alpha})}}\right) \textbf{h}_i.
The attention weights implicitly indicate how well the current output matches each element of the input sequence, essentially telling the decoder where to focus or, in other words, which parts of the input deserve the most attention.
Thanks to the softmax scaling, these weights form a discrete probability distribution over the input tokens, acting as a kind of envelope that softly selects them, where higher peaks reflect stronger relevance to the current output being generated.
As for the training process, it remains identical to that described in the previous paragraph, and the network is still trained to output probability distributions over the vocabulary. The only difference is that the alignment model must now also be learned, with its parameters jointly optimized with those of the rest of the network.

Modern Formulation

The attention mechanism in its original form is slightly different from the one presented in 5, where the authors introduced the now-infamous trio: query, key, and value.
In this alternative formulation, attention is explicitly defined as a parametric mapping that takes one vector, the query, and a finite T-length sequence of ordered key-value vector pairs as inputs and produces a weighted sum of the values as output. The unnormalized weight assigned to each value is determined by a compatibility function and scores how well the query matches the corresponding key.
To formalize this modified version of the attention mechanism, let us establish some more notation:
  • Query Vector. qRdq\textbf{q} \in \mathbb{R}^{d_q}.
  • Key-Value Pairs. {(kt,vt)}tT\{ (\textbf{k}_t, \textbf{v}_t) \}_{t \in \mathcal{T}}, with ktRdk\textbf{k}_t \in \mathbb{R}^{d_k} and vtRdv\textbf{v}_t \in \mathbb{R}^{d_v} for all tT={1,,T}t \in \mathcal{T} = \{1, \dots, T\}.
  • (CONTINUE FROM HERE)…
We can now translate in math what we explained in words.
......
Drawing a parallel between the two formulations, we can say that the compatibility function and the alignment model are the same thing under different names. Moreover, at each time step, the query corresponds to the previous decoder state in the original formulation, while keys and values coincide and correspond to the elements of the annotations sequence.

The Transformer Architecture

What a ride, and the best is yet to come! It's time to dive deep into the neural architecture that's turning the tech world (and beyond) on its head: the Transformer!
If the attention mechanism was introduced to tackle the limited memory issue in RNNs, the Transformer was designed to overcome their sequential processing bottleneck. In fact, transformers are stateless, fully parallelizable at training time when using teacher forcing, and capture the sequential nature of the input through positional encodings. Moreover, the attention mechanism is not only generalized, as we anticipated, but is also used in multiple forms, namely, self-attention and cross-attention, throughout both the encoding and decoding processes.
Training-Time Functioning of Transformer-Based Encoder-Decoder Architecture with Teacher Forcing
Training-Time Functioning of Transformer-Based Encoder-Decoder Architecture with Teacher Forcing
Inference-Time Functioning of Transformer-Based Encoder-Decoder Architecture
Inference-Time Functioning of Transformer-Based Encoder-Decoder Architecture
At a high level of abstraction, the Transformer-based transducer looks very similar to the one of the RNN-based architecture. Nevertheless, when we zoom inside the two main blocks, encoder and decoder, we see that there are no recurrent units anymore, but plenty of other things.
  • Positional Encoders. To handle the sequential nature of the input without recurrence, a specific vector is added to each embedded token depending on its absolute position in the input sequence.
  • Multi-Head Self-Attention (Encoder). It is composed of several generalized attention layers in parallel, whose outputs are concatenated column-wise and multiplied by a learnable projection matrix on the right. Each of such layers takes the input sequence in matrix form (with elements on the rows), and performs the following operations:
    • Multiply it (on the right) for three different learnable matrices to linearly project each element into three different fixed-dimensional spaces, thus obtaining queries, keys, and values.
    • Apply a scaled version of the attention that operates on matrices row-wise, called the scaled dot-product attention, to queries, keys, and values.
  • Masked Multi-Head Self-Attention (Decoder). It works as its encoding counterpart, except for the fact that it implements a masking mechanism that avoids anti-causal leakages. In fact, during training it is provided with the full output sequence, during inference with whatever has been generated up to a certain point, but each element should attend (i.e., focus on) only on the previous ones.
  • Multi-Head Cross-Attention (Decoder). It works as its encoding counterpart, but is not a self-attention layer anymore, and mixes up input and output sequences like the original attention mechanism. In particular, it uses the encoder output as both the keys and the values, while the queries come from the masked multi-head attention.
  • Element-Wise Feed Forward. This block of layers represents a fully connected Feed Forward Neural Network (FFNN), and it is applied separately and identically to each element of the input (i.e., each row). This means that, at the output of the decoder, all the output token distributions can be generated at once, which is precisely what happens at training time. At inference time, instead, all but the last are disregarded.
  • Residual Connection & Layer Normalization. There is a residual connection (i.e., additive skip-connection, see [6]) around every attention and feed forward layers block, followed by a layer normalization step (see [7]). This is done to mitigate the vanishing and exploding gradient problems and stabilize convergence during training.
Notice that we are skipped the embedders, not because they are not relevant, but because they have always been there even if we did not represent them before, and are not specific to the Transformer architecture.
On top of all these new pieces, also the autoregressive behavior at inference time works a bit differently. In fact, rather than taking only the last generated token, the decoder, being stateless, needs the previous ones as well. Nevertheless, many of the computations can be cached to accelerate the process.
So, it looks like RNNs are finally out of the equation, or are they? Take a look at [8] to see what the authors think about it.

References

  1. Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014) [Link].
  1. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems 27 (2014) [Link].
  1. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014) [Link].
  1. Gehring, Jonas, et al. "Convolutional sequence to sequence learning." International conference on machine learning. PMLR, 2017 [Link].
  1. Ashish, Vaswani. "Attention is all you need." Advances in neural information processing systems 30 (2017) [Link].
  1. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016 [Link].
  1. Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016) [Link].
  1. Feng, Leo, et al. "Were rnns all we needed?." arXiv preprint arXiv:2410.01201 (2024) [Link].