More experiments and model updates to come. This model is severely under-trained having only seen 32M samples (out of 300M possible) so far.
Reposted from
Nathan Paull
Finally got around to completing the first major training runs of my own BERT-like language embedding model. There is a ton of data to pour over as I prepare my next experiment for this weekend, but early results show my model outperforming a Transformer++ BERT model by 1% with fewer params!
Comments