GPT from scratch: Karpathy

Andrej Karpathy has an incredible video on building GPT from scratch.

Here are cartoon notes covering the contents:

tokenizer
- characters, sequences of chars become “tokens”. Many strategies for this. In this video we actually just use each character as a token.
block size
- how many tokens are used for prediction?
batch size
- how many blocks run in parallel to max speed/GPU usage
BigramModel as model template
- every llm needs…
- def forward -> (logits, loss)
- def generate
tril
- “triangular… something”
- token prediction should only look at past tokens
- tril is a matrix operation to remove future tokens
positional encoding
- add parameterized info about positions.
Attention (“the most important part”)
- single attention:
  - QKV
  - more like: L(Q)L(K)V where L is Linear
  - more like: softmax(tril(scale(L(Q),L(V))))V
- multi head attention
  - Concat([Head 1, Head 2, … Head n])
- attention block
  - Multi head attention -> Feed Forward
- sequence of blocks
- “Sequential([Block, Block, Block…])
regularization techniques
- skip connections
- layer norm
- dropout