ML/Linguistics question: Did they ever find out for sure why reversing the input in an LSTM encoder-decoder improved results? I've seen claims about short-term dependencies (which is presumably language relative) and making it harder for decoders, but can't find anything concrete.
Comments