ThreadSky
About ThreadSky
Log In
tyrellturing.bsky.social
•
115 days ago
Yes, agreed, if you have MLPs between each layer of self-attention it may be superfluous...
Comments
Log in
with your Bluesky account to leave a comment
[–]
joshdeleeuw.bsky.social
•
115 days ago
You could have scenarios where the value and keys are different vectors (incl. different sizes) coming from different sources. This isn't the common use case, but the general mechanism allows it.
2
reply
Posting Rules
Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service
×
Reply
Post Reply
Comments