there are models that can process video as video. no translation needed. others can only do a handful of frames along with the audio that’s been transcoded.

i don’t know the technicals but it’s really fucking obvious when you’re comparing the two’s results.

Comments