GPT from scratch: Karpathy

https://youtu.be/kCc8FmEb1nY

Andrej Karpathy has an incredible video on building GPT from scratch.

Here are cartoon notes covering the contents:

  • tokenizer
    • characters, sequences of chars become “tokens”. Many strategies for this. In this video we actually just use each character as a token.
  • block size
    • how many tokens are used for prediction?
  • batch size
    • how many blocks run in parallel to max speed/GPU usage
  • BigramModel as model template
    • every llm needs…
    • def forward -> (logits, loss)
    • def generate
  • tril
    • “triangular… something”
    • token prediction should only look at past tokens
    • tril is a matrix operation to remove future tokens
  • positional encoding
    • add parameterized info about positions.
  • Attention (“the most important part”)
    • single attention:
      • QKV
      • more like: L(Q)L(K)V where L is Linear
      • more like: softmax(tril(scale(L(Q),L(V))))V
    • multi head attention
      • Concat([Head 1, Head 2, … Head n])
    • attention block
      • Multi head attention -> Feed Forward
    • sequence of blocks
    • “Sequential([Block, Block, Block…])
  • regularization techniques
    • skip connections
    • layer norm
    • dropout