Recurrent Neural Networks

Summary

Designed to process sequential data, by processing elements one at a time, while maintaining a memory (hidden state) of previous inputs. Used in language, time-series data, audio, etc

Key

Input vector $x_{t}$ : the data point at the current time step.

Hidden state $h_{t}$ : a vector that represents the RNN’s memory, updated at each time step

Output vector $y_{t}$ : the output at each time step, which may depend on $h_{t}$

Updating

At each time step $t$ , the RNN takes in the current input $x_{t}$ , combines it with the previous hidden state $h_{t - 1}$ , produces a new hidden state $h_{t}$ , and optionally produces an output $y_{t}$ .

$h_{t} = tanh (W_{x} x_{t} + W_{h} h_{t - 1} + b_{h})$

$W_{x}$ = weight matrix that connects the input $x_{t}$ to the hidden layer

$W_{h}$ = weight matrix that connects the previous hidden state $h_{t + 1}$ to the current hidden state

$b_{h}$ = bias term

Generating

$y_{t} = W_{y} h_{t} + b_{y}$

$W_{y}$ = weight matrix for output layer

$b_{y}$ = bias for output layer

Backpropagation through time
In training a RNN, you need to minimize the loss over all previous time steps

Unroll the RNN through time (turning it into a deep feedforward network where each layer corresponds to a different time step)

Apply standard backprop through unrolled structure

Accumulate gradients for the shared weights at each time step

Update parameters

Forward pass $T$ , inputs $x_{1}, ..., x_{T}$ generate hidden states $h_{1}, ..., h_{T}$ and outputs $y_{1}, ..., y_{T}$ . At each time step: $h_{t} = tanh W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h}$ , $y_{t} = W_{h y} h_{t} + b_{y}$ , $l_{t} = L OSS (y_{t}, \overset{y}{^}_{t})$

For a sequence of length
Backward pass
During the backward pass, we compute gradients of the total loss $L = \sum_{t = 1}^{T} ℓ_{t}$ with respect to all trainable parameters: $W_{x h}$ , $W_{hh}$ , and $W_{h y}$ .

First, compute the output gradient at each time step:

$δ_{t}^{y} = \frac{\partial ℓ _{t}}{\partial y _{t}}$

$\frac{\partial L}{\partial W _{h y}} = \sum_{t = 1}^{T} δ_{t}^{y} h_{t}^{⊤}$

Then compute gradients flowing into the hidden states:

$δ_{t}^{h} = δ_{t}^{y} W_{h y}^{⊤} + δ_{t + 1}^{h} \cdot \frac{\partial h _{t + 1}}{\partial h _{t}}$
where $\frac{\partial h _{t + 1}}{\partial h _{t}} = diag (1 - h_{t + 1}^{2}) \cdot W_{hh}^{⊤}$ (from $tanh$ derivative)

For each time step $t$ , accumulate parameter gradients:

$\frac{\partial L}{\partial W _{hh}} + = δ_{t}^{h} ⊙ (1 - h_{t}^{2}) \cdot h_{t - 1}^{⊤}$

$\frac{\partial L}{\partial W _{x h}} + = δ_{t}^{h} ⊙ (1 - h_{t}^{2}) \cdot x_{t}^{⊤}$

The gradients are summed across all time steps due to weight sharing across $t = 1, ..., T$ .

Optionally, apply gradient clipping to avoid exploding gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Challenges

Exploding/Vanishing Gradients: the issue of gradients shrinking or growing too large as they backpropagate through time

Limited memory: Hard to remember patterns longer than 10-20 time steps

Specialized RNNs

LSTM GRU

PyTorch implementation

import torch
import torch.nn as nn
import torch.optim as optim
 
# Set seed for reproducibility
torch.manual_seed(42)
 
# Hyperparameters
input_size = 5    # dimension of x_t
hidden_size = 4   # dimension of h_t
output_size = 2   # dimension of y_t
seq_len = 6       # sequence length
batch_size = 1    
lr = 0.01
 
# Sample input and target
x_seq = torch.randn(seq_len, batch_size, input_size)  # shape: (T, B, D)
target_seq = torch.randint(0, output_size, (seq_len, batch_size))  # classification targets
 
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.Wxh = nn.Linear(input_size, hidden_size)
        self.Whh = nn.Linear(hidden_size, hidden_size)
        self.Why = nn.Linear(hidden_size, output_size)
        self.tanh = torch.tanh
 
    def forward(self, x_seq):
        T, B, _ = x_seq.size()
        h_t = torch.zeros(B, hidden_size)  # initialize hidden state
        outputs = []
 
        for t in range(T):
            x_t = x_seq[t]
            h_t = self.tanh(self.Wxh(x_t) + self.Whh(h_t))
            y_t = self.Why(h_t)
            outputs.append(y_t)
 
        return torch.stack(outputs)  # shape: (T, B, output_size)
 
# Instantiate model and optimizer
model = RNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
 
# Training step
model.train()
optimizer.zero_grad()
 
# Forward pass
y_preds = model(x_seq)  # shape: (T, B, output_size)
 
# Compute loss
loss = 0
for t in range(seq_len):
    loss += criterion(y_preds[t], target_seq[t])
 
# Backward pass (BPTT)
loss.backward()
 
# Gradient clipping (optional)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
 
# Parameter update
optimizer.step()
 
print(f"Loss: {loss.item():.4f}")

Brayden Zhang

Explorer

Recurrent Neural Networks

PyTorch implementation

Graph View

Backlinks