See e.g. https://arxiv.org/abs/2410.18613 that recently showed that we can replace softmax attention with alternatives that do not satisfy the properties we intuitively assign to it, and yet these models seem to work just as well! (2/2)

Comments