06:43 I'd say this matrix is to keep the dimension the same between different encoder layers, since input to each layer is seqlength * embedding (if we ignore the batch), but output of multi headed attention is seqlength * (nheads*embedding). also this matrix obviously learns to grasp some information from multi headed attention, as you mentioned.
I believe this is the best explanation of MSA .... Please keep teaching us
Thank for the tutorial.
I'm a little confused: Is this multi-head attention, or multi-head self-attention?, Because it seems like MSA to me. If anyone can help me clear up, I'd be grateful.
Du Legende
I can't understand how the different heads would be able to focus on different aspects of a sequence if the starting point is the same embedding vector, hence the same weight matrix from the dot product.
Hello Dr. Raschka, thanks for the nice video and explanation. What exactly is the purpose of using multiple heads instead of only one? Is it shown (empirically or theoretically) in any research work that the outputs at different heads learn different semantic concepts ?
where to find the entire series, please?
Thanks a lot!!!
liked and subscribed
Quick question.... am I right in saying that each encoder layer has 3h+3 matrices to learn ? Where h is the number of heads and the other 3 are the linear layers in the encoder?
where is positional embedding concept ??
so multi head is just another way to say multiple set of weights... god i hate english
@informomp