Links to the book:
amzn.to/4fqvn0D (Amazon)
mng.bz/M96o (Manning)
Link to the GitHub repository: github.com/rasbt/LLMs-from-scratch
This is a supplementary video going over text data preparations steps (tokenization, byte pair encoding, data loaders, etc.) for LLM training.
00:00 2.2 Tokenizing text
14:02 2.3 Converting tokens into token IDs
23:56 2.4 Adding special context tokens
30:26 2.5 Byte pair encoding
44:00 2.6 Data sampling with a sliding window
1:07:10 2.7 Creating token embeddings
1:15:45 2.8 Encoding word positions
You can find additional bonus materials on GitHub:
Byte Pair Encoding (BPE) Tokenizer From Scratch, github.com/rasbt/LLMs-from-scratch/blob/main/ch02/…
Comparing Various Byte Pair Encoding (BPE) Implementations, github.com/rasbt/LLMs-from-scratch/blob/main/ch02/…
Understanding the Difference Between Embedding Layers and Linear Layers, github.com/rasbt/LLMs-from-scratch/blob/main/ch02/…
Data sampling with a sliding window with number data, github.com/rasbt/LLMs-from-scratch/blob/main/ch02/…
コメント