Decoder-Only Transformers

Masked self-attention is a variant of self-attention used in decoder-only transformers.

The tokens that follow a given token within the sequence are “masked out” - no attention score is computed between the given token and the following tokens.

This prevents looking forward in the sequence during self-attention.

Brayden Zhang

Explorer

Decoder-Only Transformers

Graph View

Backlinks