Masked self-attention

  • Masked self-attention is a variant of self-attention used in decoder-only transformers.
  • The tokens that follow a given token within the sequence are “masked out” - no attention score is computed between the given token and the following tokens.
    • This prevents looking forward in the sequence during self-attention.