The "Self-Extend" paper http://arxiv.org/abs/2401.01325 promises magic for your LLMs: extending the context window beyond what they were trained on. You can take an LLM trained on 2000 token sequences, feed it 5000 tokens and expect it to work. Thread 🧵 (SWA below=sliding window... - ThreadSky

About ThreadSky

martin-gorner.bsky.social • 409 days ago

The "Self-Extend" paper http://arxiv.org/abs/2401.01325 promises magic for your LLMs: extending the context window beyond what they were trained on. You can take an LLM trained on 2000 token sequences, feed it 5000 tokens and expect it to work. Thread 🧵
(SWA below=sliding window...

Comments

martin-gorner.bsky.social•409 days ago

attn.)

martin-gorner.bsky.social•409 days ago

To be fair, some LLMs can already do that, if they are trained with a specific positional encoding like Alibi (https://arxiv.org/abs/2108.12409). And before LLMs, Recurrent Neural Networks (RNNs) could do this trick as well. But was lost in Transformers.

martin-gorner.bsky.social•409 days ago

So how does it work? It turns out that if you understand the self-attention mechanism, a bit of high-school math goes a long way. Here is how self-attention is computed in Transformers:

martin-gorner.bsky.social•409 days ago

In short, you feed the input sequence of embeddings through two different dense layers Wq and Wk. You call the resulting sequences Q(uery) and K(ey) and compute all possible dot products between the vectors of the two sequences.
(scaling + normalization ignored for simplicity)

martin-gorner.bsky.social•409 days ago

The resulting square matrix is your self-attention matrix. In "multi-headed attention", you do this multiple times for extra fun.

martin-gorner.bsky.social•409 days ago

With this, transformers can learn to compute specific attention values for specific pairs of tokens. But in language, position matters too so you need to add something, called a "positional embedding" that depends on the relative position of two tokens.

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply