Summary

Designed to process sequential data, by processing elements one at a time, while maintaining a memory (hidden state) of previous inputs. Used in language, time-series data, audio, etc

Key

  • Input vector : the data point at the current time step.
  • Hidden state : a vector that represents the RNN’s memory, updated at each time step
  • Output vector : the output at each time step, which may depend on

Updating

At each time step , the RNN takes in the current input , combines it with the previous hidden state , produces a new hidden state , and optionally produces an output .

  • = weight matrix that connects the input to the hidden layer
  • = weight matrix that connects the previous hidden state to the current hidden state
  • = bias term

Generating

  • = weight matrix for output layer
  • = bias for output layer

Backpropagation through time

In training a RNN, you need to minimize the loss over all previous time steps

  1. Unroll the RNN through time (turning it into a deep feedforward network where each layer corresponds to a different time step)

  2. Apply standard backprop through unrolled structure

  3. Accumulate gradients for the shared weights at each time step

  4. Update parameters

Forward pass , inputs generate hidden states and outputs . At each time step: , ,

For a sequence of length

Backward pass

During the backward pass, we compute gradients of the total loss with respect to all trainable parameters: , , and .

First, compute the output gradient at each time step:

Then compute gradients flowing into the hidden states:


  • where (from derivative)

For each time step , accumulate parameter gradients:

The gradients are summed across all time steps due to weight sharing across .

Optionally, apply gradient clipping to avoid exploding gradients:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Challenges

  • Exploding/Vanishing Gradients: the issue of gradients shrinking or growing too large as they backpropagate through time
  • Limited memory: Hard to remember patterns longer than 10-20 time steps

Specialized RNNs

PyTorch implementation

import torch
import torch.nn as nn
import torch.optim as optim
 
# Set seed for reproducibility
torch.manual_seed(42)
 
# Hyperparameters
input_size = 5    # dimension of x_t
hidden_size = 4   # dimension of h_t
output_size = 2   # dimension of y_t
seq_len = 6       # sequence length
batch_size = 1    
lr = 0.01
 
# Sample input and target
x_seq = torch.randn(seq_len, batch_size, input_size)  # shape: (T, B, D)
target_seq = torch.randint(0, output_size, (seq_len, batch_size))  # classification targets
 
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.Wxh = nn.Linear(input_size, hidden_size)
        self.Whh = nn.Linear(hidden_size, hidden_size)
        self.Why = nn.Linear(hidden_size, output_size)
        self.tanh = torch.tanh
 
    def forward(self, x_seq):
        T, B, _ = x_seq.size()
        h_t = torch.zeros(B, hidden_size)  # initialize hidden state
        outputs = []
 
        for t in range(T):
            x_t = x_seq[t]
            h_t = self.tanh(self.Wxh(x_t) + self.Whh(h_t))
            y_t = self.Why(h_t)
            outputs.append(y_t)
 
        return torch.stack(outputs)  # shape: (T, B, output_size)
 
# Instantiate model and optimizer
model = RNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
 
# Training step
model.train()
optimizer.zero_grad()
 
# Forward pass
y_preds = model(x_seq)  # shape: (T, B, output_size)
 
# Compute loss
loss = 0
for t in range(seq_len):
    loss += criterion(y_preds[t], target_seq[t])
 
# Backward pass (BPTT)
loss.backward()
 
# Gradient clipping (optional)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
 
# Parameter update
optimizer.step()
 
print(f"Loss: {loss.item():.4f}")