To try: have a autoregressive model predict number of tokens needed before generating. One problem is once you commit to generating N tokens you don’t know the gradients for 1..{N}…Max_N, which might be better choices.
We could run all Max_N lengths and take min gradient which seems expensive
We could run all Max_N lengths and take min gradient which seems expensive
Comments
Spatially adaptive Computation time
Depth adaptive transformer