LSTM

Summary

LSTM (Long Short-Term Memory) is a type of RNN designed to overcome the vanishing gradient problem and remember information over longer sequences.

Input vector $x_{t}$ : the data point at the current time step (e.g., a word embedding)

Hidden state $h_{t}$ : the output at time $t$ , passed to the next time step and possibly the next layer

Cell state $C_{t}$ : internal memory that carries forward important information through time

The LSTM uses gates to control information flow — allowing it to selectively “remember” or “forget” things.

Gates in an LSTM Cell

At each time step $t$ , the LSTM processes:

the current input $x_{t}$
the previous hidden state $h_{t - 1}$
the previous cell state $C_{t - 1}$

Forget Gate

What does it do?

The forget gate decides what information to erase from the previous cell state.

f_{t} = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f})

Outputs values in $[0, 1]$
0 = forget completely, 1 = retain fully

Input Gate

What does it do?

The input gate decides what new information should be stored in the memory cell.

i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i})

Compute the candidate memory values:

\tilde{C}_{t} = tanh (W_{C} [h_{t - 1}, x_{t}] + b_{C})

Update the Cell State

Update the internal memory:

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ \tilde{C}_{t}

Forget some of the past
Add some of the new

📤 Output Gate

What does it do?

The output gate decides what part of the memory to pass forward as the hidden state.

o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})

Final hidden state output:

h_{t} = o_{t} ⊙ tanh (C_{t})

PyTorch Implementation

import torch
import torch.nn as nn
 
class CustomLSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
 
        # Combine all gates into one matrix for efficiency:
        # [forget_gate | input_gate | candidate_memory | output_gate]
        self.linear = nn.Linear(input_size + hidden_size, 4 * hidden_size)
 
    def forward(self, x_t, h_prev, C_prev):
        # Concatenate previous hidden state and current input
        combined = torch.cat((h_prev, x_t), dim=1)  # shape: [batch_size, input_size + hidden_size]
 
        # Apply linear transformation to get all gate pre-activations
        gates = self.linear(combined)  # shape: [batch_size, 4 * hidden_size]
        f_t, i_t, g_t, o_t = torch.chunk(gates, chunks=4, dim=1)
 
        # Apply activations
        f_t = torch.sigmoid(f_t)       # forget gate
        i_t = torch.sigmoid(i_t)       # input gate
        g_t = torch.tanh(g_t)          # candidate memory
        o_t = torch.sigmoid(o_t)       # output gate
 
        # Update cell state
        C_t = f_t * C_prev + i_t * g_t
 
        # Compute new hidden state
        h_t = o_t * torch.tanh(C_t)
 
        return h_t, C_t
 
class CustomLSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.cell = CustomLSTMCell(input_size, hidden_size)
 
    def forward(self, x):
        # x: shape [seq_len, batch_size, input_size]
        seq_len, batch_size, input_size = x.size()
        h_t = torch.zeros(batch_size, self.cell.hidden_size)
        C_t = torch.zeros(batch_size, self.cell.hidden_size)
 
        outputs = []
 
        for t in range(seq_len):
            x_t = x[t]
            h_t, C_t = self.cell(x_t, h_t, C_t)
            outputs.append(h_t)
 
        return torch.stack(outputs)  # shape: [seq_len, batch_size, hidden_size]

References

Understanding LSTM Networks — colah’s blog

Brayden Zhang

Explorer

LSTM

Gates in an LSTM Cell

Forget Gate

Input Gate

Update the Cell State

📤 Output Gate

PyTorch Implementation

References

Table of Contents

Graph View

Backlinks