Yet more evidence that transfer learning of sequence-only PLMs does not benefit from scale beyond 650M params 🧡

Comments