To try: have a autoregressive model predict number of tokens needed before generating. One problem is once you commit to generating N tokens you don’t know the gradients for 1..{N}…Max_N, which might be better choices.

We could run all Max_N lengths and take min gradient which seems expensive

Comments