# CS-E4890 - Deep Learning

## Content

### Lecture 1: Linear Classifiers and Multilayer Perceptrons

• Using the ReLU function makes the MLP a piecewise linear function.
• Chain rule via Jacobian matrices
• Each computational unit comprises of the trainable parameters, the inputs and the outputs. Implementation requires the forward computation of the outputs given the inputs and parameters, and the backward computations take the outputs and return the gradients of the outputs with respect to the parameters and the inputs.
• (Cybenko 1989) - Universal approximation theorem for single hidden layer MLPs.
• (Pascanu, Mikolov, and Bengio 2013) - MLP with a single hidden layer with $$n$$ inputs containing $$Lm$$ units has $$\mathcal{O}(L^nm^n)$$ regions for which the network is a piecewise linear function.
• (Montúfar et al. 2014) - MLP with $$L$$ hidden layers with $$n$$ inputs and width $$m > n$$ can compute functions with $$\Omega((m/n)^{(L-1)n}m^n)$$ linear regions.

### Lecture 5: Recurrent Neural Networks

We start by the motivation that in some tasks the inputs can be sequences of different sizes. One such task is sentiment analysis, where we want to classify sentences from some natural language into $$C=\{1,\cdots,c\}$$ with 1 being the worst sentiment.

A simple, naive way to process such sequences is via a for loop. For e.g. take the problem of counting the number of zeros in some sequence $$(x_{1_n},...,x_{T_n})$$ where $$T_n$$ denotes the length of the $$n^{th}$$ sample. A simple for loop for this problem is:

h = 0
for x_i in x:
if x_i==0:
h = h+1


The computation graph for this would involve a function $$f$$ being applied sequentially, and at each step $$i$$, the corresponding input $$x_i$$ is also part of the input in addition to the output from the previous layer. After taking these inputs, $$f$$ returns the output from the previous layer, which counts the number of 0s until $$x_{i-1}$$ then increments the value by 1 if $$x_i=0$$.

For more complex problems, we can still have some function $$f$$ applied in a similar manner, however $$f$$ is now a function we want to learn from data. This gives rise to the vanilla Recurrent Neural Network (RNN), where we have

$y = \text{Softmax}(\mathbf{f}(\mathbf{h}_{N-1}, \mathbf{x}_N))$ $\mathbf{h}_{N-1}=\mathbf{f}(\mathbf{h}_{N-2},\mathbf{x}_{N-1})$

Where $$\mathbf{f}(\mathbf{h},\mathbf{x})=\sigma(\mathbf{W}\mathbf{h}+\mathbf{U}\mathbf{x}+\mathbf{b})$$ for some non-linearity $$\sigma$$, the $$\mathbf{h}_i$$ are called hidden states, and $$\mathbf{h}_0$$ is predefined for e.g. is a sample from some distribution. Here the softmax is put at the end assuming that we want to classify sequences.

Compared to a standard feedforward network which starts with the input and every layer has its own learnable parameters, an RNN some hidden state $$\mathbf{h}_0$$ with parts of the input being applied sequentially, and the learnable parameters are shared throughout the network. (TODO Are there networks where all of the input is applied at every step in the RNN? I suppose not since otherwise $$f$$ cannot be some fixed function to be learnt.)

Training RNNs as described above is similar to how we trained feedforward networks - we can use for e.g. the cross-entropy loss.

Backpropagation in RNNs can be thought of as first assuming each layer has its own parameters and backpropagating, then using those gradients to compute the true gradient wrt shared parameters. Since the paramaters we assume exist for each layer are the shared parameters for the whole network, we can by the chain rule get that we have to sum over the gradients at each iteration of the sequence. Another way to think of this is that there is a function that takes the global shared parameters and generates parameters for each layer - here it is the identity function and we backpropagate through that. This is called backpropagation through time. (TODO Where do the parameters for each layer alone come from?)

To see how recurrence might not be the best idea, consider a vanilla RNN starting with hidden state $$\mathbf{0}$$ with no bias term i.e $$\mathbf{b}=\mathbf{0}$$. At iteration $$t$$ the hidden state is then $$\mathbf{h}_t = \sum_{\tau=1}^t \mathbf{W}^{t-\tau}\mathbf{U}\mathbf{x}_\tau$$. Assuming $$\mathbf{W}$$ is diagonalizable, $$\mathbf{W}^{t-\tau} = \mathbf{Q}\Lambda^{t-\tau}\mathbf{z}||^2 = ||\Lambda^{t-\tau}\mathbf{z}||^2 = \sum_i (\lambda_i^{t-\tau}z_i)^2$$. Here we use the orthogonality of $$\mathbf{Q}$$, and we see that if even one eigenvalue $$\lambda_i$$ of $$\mathbf{W}$$ is such that $$\lambda_i>1$$ then the norm explodes and we will have very large numbers in forwrd computations.

The largest absolute value among the eigenvalues of $$\mathbf{W}$$ is called the spectral radius. One solution to this is to use the $$\tanh$$ non-linearity which bounds the outputs to be in $$(-1,1)$$.

Audience question: When using mini-batches, what happens if we have sequences of different length in these mini-batches? Lecturer answer: One method is to create a tensor of the same-size, then mark the index at the end of each sequence in the batch so that it discards the values in the tensors which come after that index.

But using $$\tanh$$ introduces another problem, namely the gradients are now $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_1}^T = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}^T \prod_{\tau=t,\cdots,2} \frac{\partial \mathbf{h}_\tau}{\partial \mathbf{h}_{\tau-1}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}^T \prod_{\tau=1,\cdots,2}\text{diag}(\phi_\tau')\mathbf{W}$ Where the LHS is a column vector of partial derivatives of the loss with respect to the hidden state $$\mathbf{h}_1$$ and $$\phi_\tau' = \phi'(\mathbf{W}\mathbf{h}_{\tau-1}+\mathbf{U}\mathbf{x}_t+\mathbf{b})$$ is the derivative of the non-linearity (here, $$\tanh$$). Assuming that the neurons are not saturated which means $$\phi(h_i)\approx 0$$, then there is some $$\gamma$$ such that $$|\phi'_\tau|\geq \gamma$$ and if the spectral radius of $$\mathbf{W}$$ is greater than $$\frac{1}{\gamma}$$ then the gradient explodes. From this we can conclude that we would want the neuons to be saturated. One solution to this problem is gradient clipping, where if the norm of the gradient is greater than some predefined threshold, we normalize it to have the same magnitude as this threshold.

However, if we manage to have that $$0 < |\phi'_\tau| \leq 1$$, and the spectral radius of $$\mathbf{W}$$ is less than 1, the gradient will decay exponentially in the size of $$t$$, this is called the vanishing gradient phenomenon. Thus here, we would want the neurons to be non-saturated, which conflicts with the objective we want to prevent exploding gradients above. When we have vanishing gradients, it becomes harder to learn long-range dependencies.

In practice, recurrent units with gating mechanisms work better, for e.g. Long Short-Term Memory (LSTM) and Gating Recurrent Units (GRU).

The motivation for GRUs comes from the fact that we want to keep old values from the hidden state instead of using all of it when performing the forward pass. GRUs implement this by using an additional update gate $$\mathbf{u}_t \in (0,1)$$ that controls which states should be updated. $\mathbf{h}_t = (1 - \mathbf{u}_t) \odot \mathbf{h}_t + \mathbf{u}_t \odot \tilde{\mathbf{h}}_t$ $\mathbf{u}_t = \sigma(\mathbf{W}_u\mathbf{h}_{t-1} + \mathbf{U}_u\mathbf{x}_t + \mathbf{b}_u)$

Where $$\sigma$$ is the sigmoid function, $$\tilde{\mathbf{h}}_t$$ is the new state candidates, which are computed using only the states select by a reset gate $$\mathbf{r}_t$$, giving: $\tilde{\mathbf{h}}_t = \phi(\mathbf{W}(\mathbf{r}_t \odot \mathbf{h}_{t-1} + \mathbf{U}\mathbf{x}_t + \mathbf{b}_h)$ $\mathbf{r}_t = \sigma(\mathbf{W}_r \mathbf{h}_{t-1} + \mathbf{U}_r \mathbf{x}_t + \mathbf{b}_r)$

Writing the gradients wrt to the hidden state, we see that for the GRU, the gradients decay with a rate $$\frac{1}{2}$$ rather than $$\gamma$$ as in RNNs, and if $$\gamma$$ is large, the gradient magnitude grows exponentially as $$\mathcal{O}(\Big( \frac{\gamma}{4}\Big)^t)$$ compared to $$\mathcal{O}(\gamma^t)$$ in RNNs.

Consider the Linear Gaussian model with temporal structure: $\mathbb{P}(\mathbf{h}_1) = \mathcal{N}(\mathbf{h}_1|\mathbf{\mu}_1,\mathbf{R}_1)$ $\mathbb{P}(\mathbf{h}_t|\mathbf{h}_{t-1}) = \mathcal{N}(\mathbf{h}_t|\mathbf{B}\mathbf{h}_{t-1},\mathbf{R})$ $\mathbb{P}(\mathbf{x}_t|\mathbf{h}_t)=\mathcal{N}(\mathbf{x}_t|\mathbf{A}\mathbf{h}_t,\mathbf{V})$

When finding the conditional distribution $$\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t})$$, using a message passing algorithm yields the Kalman filter, which has 2 steps:

• Prediction: $$\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t-1})=\mathcal{N}(\mathbf{h}_t|\mathbf{h}_t^*,\mathbf{P}_t)$$ where $$\mathbf{h}_t* = \mathbf{B}\mathbf{h}_{t-1}^{**}$$ and $$\mathbf{P}_t = \mathbf{B}\mathbf{\Sigma}_{t-1}\mathbf{B}^T + \mathbf{R}$$
• Correction: $$\mathbb{P}(\mathbf{h}_t|\mathbf{x}_{1:t}) = \mathcal{N}(\mathbf{h}_t|\mathbf{h}_t^{**},\mathbf{\Sigma}_t)$$ where $$\mathbf{h}_t^{****} = \mathbf{h}_t^* + \mathbf{K}_t(\mathbf{x}_t - \mathbf{A}\mathbf{h}_t^{**})$$, $$\mathbf{\Sigma}_t = (\mathbf{I} - \mathbf{K}_t\mathbf{A})\mathbf{P}_{t-1}$$ and $$\mathbf{K}_t =\mathbf{P}_{t-1}\mathbf{A}^T(\mathbf{A}\mathbf{P}_{t-1}\mathbf{A}^T + \mathbf{V})^{-1}$$.

In the one-dimensional case, we can see the corrected value of the hidden state can be rewrittena as, $$h^{**}_t = (1-u_t)h^*_t + u_t\frac{x_t}{a}$$ where $$u_t$$ is a sigmoid of the form $$u_t = \sigma(\log \frac{a^2 p_{t-1}}{a^2 p_{t-1}+v})$$. Hence comparing the update rules for the hidden state in the GRU, we see that they are quite similar, and somewhat gives clarity, if not justification to the gating mechanism which seems a bit arbitrary in the beginning.

LSTMs on the other hand introduce a new vector of states $$\mathbf{c}_t$$, thought of as a private attribute for an LSTM cell which is updated as: $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_{t-1}\odot \phi_c (\mathbf{W}_c\mathbf{h}_{t-1} + \mathbf{U}_c\mathbf{x}_t + \mathbf{b}_c)$ with the forget gate $$\mathbf{f}_t =\sigma(\mathbf{W}_f\mathbf{h}_{t-1} +\mathbf{U}_f\mathbf{x}_t +\mathbf{b}_f)$$ and the input gate $$\mathbf{i}_t = \sigma (\mathbf{W}_i\mathbf{h}_{t-1} + \mathbf{U}_i\mathbf{x}_t + \mathbf{b}_i)\in (0,1)$$.

The output of the LSTM cell, here called the obsered state is then: $\mathbf{h}_t = \mathbf{o}_t \odot \phi_h(\mathbf{c}_t)$ where the output gate is $$\mathbf{o}_t = \sigma(\mathbf{W}_o\mathbf{h}_{t-1} + \mathbf{U}_o\mathbf{x}_t + \mathbf{b}_o)$$.

The gradients of this state is then $$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}({\mathbf{f}_t})$$, and hence if we have $$\mathbf{f}_t=1$$ this gradient neither vanishes nor explodes.

When initializing weights for LSTMs, it is common pracitce to have small random weights for $$\mathbf{b}_f$$ which sets the forget gate to around $$\frac{1}{2}$$ (TODO: Why?) - But setting these with large values means the forget gates are closer to 1 and this encourages gradients to flow better back through time, allowing us to learn long-range dependencies.

Recurrent units are commonly used in sequence-to-sequence models.

### Lecture 6: Attention-based models

Consider the sequence-to-sequence model based on an encoder-decoder framework, where we encode the whole input sequence into a single state (which is sometime called the “context”), and a decoder which uses the context alongside a transformed version of it.

Suppose instead of a single state, we want to encode the input sequence to a representation which varies with length, i.e. longer input sequences lead to longer representations of its encoding. A first attempt at this may involve generating $$k$$ states for $$k$$ inputs, where the intermediate representation evolves based on the immediately previous intermediate representation. (E.g. $$x_1,...x_k$$ generates the representations $$z_1,...,z_k$$, with the intermediate representations $$h_1,...,h_k$$ such that $$h_i$$ depends only on $$h_{i-1}$$). This however does not work well in practice since the intermediate representations only get information from their preceeding values, meanwhile it is possible that values at successive indices provide have information that is relevant. One solution is to use a bi-directional RNN for the intermediate representations (Bahdanau et al. 2014), so the intermediate representation $$h_i$$ is now a concatenation of a chain on intermediate representations, one which goes forward in the sequence order and the other going backward.

Now we would want to pass the varying length input representation to the decoder. This is done via an attention block. The attention block takes a varying length input representation and produces a context of fixed size for each iteration when decoding the sequence. This attention block, which we have not defined yet, can be thought of as a switch which selects parts of the input representation that is relevant to the current iteration in the decoding sequence. The attention block can take as input the previous hidden state from the decoding sequence as well. Mathematically, at iteration $$i$$ of the decoding sequence, we could have the output $$\mathbf{c}$$ of the attention block as: $\mathbf{c} = \sum_{j} \alpha_j \mathbf{z}_j$ where $$\alpha_j = \frac{\exp(e_j)}{\sum_{j'}\exp(e_{j'})}}$$ with $$e_j = \text{MLP}(\mathbf{h}_{i-1},\mathbf{z}_j)$$ where $$\{\mathbf{z}_j\}_{j=1}^J$$ are the varying length input representations. (#ASK So how does the size of the attention blocks change with different inputs).

(#ASK In the lecture around 21:40 sounds like he implies the MLP is difference for each input from the encoder representation, but the MLP has no index, so I assume its the same MLP?)

Attention blocks have also been used in image captioning tasks, see the paper “Show, attend and tell”.

We now look at the intermediate representation, namely the sequential nature of the bi-directional RNN means that training could be slow since this step of the encoding process depends on the size of the input. Gehring et al 2017 adopt a CNN instead of a bi-directional RNN for this step. The CNN can do the computations in parallel, however it cannot take into account whether a position is at the begininng or end of the sequence. To first this problem, Gehring et al 2017 simply include this information by adding a position embedding value $$\mathbf{p}_j$$ to each input word embedding $$\mathbf{x}_j$$. (Note that here they add the positional embedding to the input embedding instead of concatenating). In the decorder, we can get rid of RNNs by using a CNN of the form $$\mathbf{y}_i = \text{CNN}(\mathbf{y}_{i-1},\cdots,\mathbf{y}_i,\mathbf{z}_1,\cdots,\mathbf{z}_n)$$ (an Auto-regressive model) where the receptive field of each $$\mathbf{y}_i$$ does not include any subsequent elements $$\mathbf{y}_j \: : j\geq i$$ - this can be done using shifted convolutions. Shifted convolutions produce outputs aligned with the input, where padding is added to the beginning of the edges instead of all edges, they are also called auto-convolutions. 1D convolutions are used since here we process sequences. Gehring et al 2017 went one step further and added an attention block to the decoder as well, which finally produces the output sequence $$\mathbf{o}_i$$ such that: $\mathbf{o}_i = \sum_j \alpha_{ij}(\mathbf{z}_j+\mathbf{x}_j+\mathbf{p}_j)$ Where $$\alpha_{ij}=\frac{\exp(\mathbf{h}_i^T\mathbf{z}_j)}{\sum_{j'=1}^n \exp(\mathbf{h}_i^T\mathbf{z}_{j'})}$$. (#TODO Check how this attention mechanism works, there was a typo in slide 18).

Gehring et al 2017 in fact used multiple decoder blocks, skip connections and much more.

Transformers were introduced by Vaswani et al 2017. The general architecture of a transformer is very similar to the convolutional sequence-to-sequence model from Gehring et al 2017, to recall, which had an encoder architecture that takes an input sequence $$\mathbf{x}_1,\cdots,\mathbf{x}_n$$, generates from them the contexts $$\mathbf{z}_1,\cdots,\mathbf{z}_n$$, which are passed onto a decoder which uses an attention mechanism to produce the final sequence $$\mathbf{y}_1,\cdots,\mathbf{y}_m$$.

The transformer network has a different attention mechanism which they call the scaled dot-product attention, where the output sequence only uses $$\mathbf{z}_i$$’s without the input embedding and positional encoding values, and the weights are computed using the formula: $\alpha_{ij} = \frac{\exp(\mathbf{z}_j^T\mathbf{h}_i/\sqrt{d_k})}{\sum_{j'=1}^n \exp(\mathbf{z}_{j'}^T\mathbf{h}_i/\sqrt{d_k})}$ Where $$d_k$$ is the dimensionality of $$\mathbf{z}_j$$ and $$\mathbf{h}_i$$. The authors suggest the following interpretation for this attention block, as a layer that finds values $$\mathbf{v}_j=\mathbf{z}_j$$ with keys $$\mathbf{k}_j=\mathbf{z}_j$$ closest to the query $$\mathbf{q}_i=\mathbf{h}_i$$. Re-writing the scaled dot-product attention with this interpretation we have: $\mathbf{o}_i =\sum_j \alpha_{ij}\mathbf{v}_j$ Where $$\alpha_{ij} = \frac{\exp(\mathbf{k}_j^T\mathbf{q}_i/\sqrt{d_k})}{\sum_{j'=1}^n\exp(\mathbf{k}_{j'}^T\mathbf{q}_i/\sqrt{d_k})}$$.

In Vaswani et al 2017, the authors in fact use multi-headed attention, whereby keys, queries and values are first projected into lower-dimensional spaces and the scaled dot-product attention is performed there, with the outputs being concatenated. Thus we have for $$\mathbf{V}\in \mathbb{R}^{m\times d_v}, \mathbf{Q}\in\mathbb{R}^{n\times d_k}, \mathbf{K}\in\mathbb{R}^{m\times d_k}$$ the multi-headed attention block is $$\text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O \in \mathbb{R}^{n\times d_k}$$ and $$\text{head}_i = \text{Attention}(\mathbf{Q}W_i^Q, \mathbf{K}W_i^K, \mathbf{V}W_i^V) \in \mathbb{R}^{n\times d_i}$$.

For the encoder block, they use the same multi-headed attention mechanism to convert the inputs $$\mathbf{x}_1,\cdots,\mathbf{x}_n$$ into $$\mathbf{z}_1,\cdots,\mathbf{z}_n$$, where the vectors $$\mathbf{x}_i$$ are used as keys, values and queries - this is called self-attention. The first position here affects the representation in the last position and vice-versa after just one layer, in comparison to CNNs and variants of RNNs used previously. In this form, when we shuffle the inputs of the self-attention block, the resulting output is simply the same permutation i.e. this block is permutation-equivariant. To combat this they use a hard-coded positional embedding which for element $$i$$ of the encoding is a sine of cosine of $$\frac{p}{10000^{2i/d}$$ depending on whether $$i$$ is even or odd respectively. Here $$p$$ is the position, and the encoding itself has dimensionality $$d$$ which is the same as the input/output embeddings. In this encoder block they then use skip connections, and an MLP as well, and all of these can be stacked on top.

For the decoder block, which must implement an auto-regressive model, they propose to use a so-called masked self-attention block when converting the outputs $$\mathbf{y}_i$$ being generated to queries $$\mathbf{q}_i$$ which can be implemented as: $\mathbf{h}_i = \sum_j \alpha_{ij}\mathbf{v}_j$ Where $\alpha_{ij} = \frac{\exp(\mathbf{v}_j^T\mathbf{v}_i/\sqrt{d_k} + m_{ij})}{\sum_{j'=1}^n \exp(\mathbf{z}_{j'}^T\mathbf{h}_i/\sqrt{d_k} + m_{ij'})}$ With the masks $$m_{ij}$$ such that $$m_{ij}=\infty$$ if $$j>i$$ else 0, thus $$\alpha_{ij}=0$$ for $$j>i$$ meaning $$\mathbf{v}_{i+1},\cdots,\mathbf{v}_m$$ are not used when computing $$\mathbf{h}_i$$. The overall decoder thus uses a combination of the masked multi-headed attention, skip connections, multi-headed attention and MLPs, with the final decoder being stacked as needed.

Training transformers often use a ramp-up learning schedule, with a linearly growing learning rate followed by decaying learning rate after some predefined iteration number.

One of the famous models that make use of these ideas is BERT from Devlin et al 2018, which is often used a pre-trained model which can be later fine-tuned for particular tasks. This model is essentially a transformer encoder, which takes in a single or pair of sentences (pairs are important for downstream tasks such as question answering) and outputs a single or pair of sentences respectively, where both the input and output have the same size. One pretraining task for BERT is the following: mask a token to corrupt the input sequence and reconstruct this masked token. Another pretraining task is to take 2 sentences and predict (binary output) whether the second sentence follows from the first sentence. After these pretraining tasks, one can now finetune the model for the specific task.

## References

Cybenko, George. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems 2 (4). Springer: 303–14.
Montúfar, Guido, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. “On the Number of Linear Regions of Deep Neural Networks.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, 2924–32. Nips’14. Montreal, Canada: MIT Press.
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In International Conference on Machine Learning, 1310–18. PMLR.

Created: 2022-04-04 Mon 23:39

Validate