Masked self-attention
- Masked self-attention is a variant of self-attention used in decoder-only transformers.
- The tokens that follow a given token within the sequence are “masked out” - no attention score is computed between the given token and the following tokens.
- This prevents looking forward in the sequence during self-attention.