@informomp

The clarity of the explanation is often proportional to the depth of the understanding. Awesome

@КонстантинДемьянов-л2п

06:43 I'd say this matrix is to keep the dimension the same between different encoder layers, since input to each layer is seqlength * embedding (if we ignore the batch), but output of multi headed attention is seqlength * (nheads*embedding). also this matrix obviously learns to grasp some information from multi headed attention, as you mentioned.

@AishKhan-le7xq

I believe this is the best explanation of MSA .... Please keep teaching us

@Mewgu_studio

Thank for the tutorial.

@Itskfx

I'm a little confused: Is this multi-head attention, or multi-head self-attention?, Because it seems like MSA to me.
If anyone can help me clear up, I'd be grateful.

@timverbarg

Du Legende

@parmanand3956

I can't understand how the different heads would be able to focus on different aspects of a sequence if the starting point is the same embedding vector, hence the same weight matrix from the dot product.

@vandanaschannel4000

Hello Dr. Raschka, thanks for the nice video and explanation. What exactly is the purpose of using multiple heads instead of only one? Is it shown (empirically or theoretically) in any research work that the outputs at different heads learn different semantic concepts ?

@Divyaa1100

where to find the entire series, please?

@orientzdf5096

Thanks a lot!!!

@orientzdf5096

liked and subscribed

@billykotsos4642

Quick question.... am I right in saying that each encoder layer has 3h+3 matrices to learn ? Where h is the number of heads and the other 3 are the linear layers in the encoder?

@subashchandrapakhrin3537

where is positional embedding concept ??

@rikki146

so multi head is just another way to say multiple set of weights... god i hate english