I decided to make the datasets I'm generating for my phd project public. For each protein in Swiss-Prot, I'm making available PLM embeddings (ProtTrans, Ankh, ESM2), GO annotations and taxonomy representations. All files follow the same order, one line per protein.
https://github.com/pentalpha/protein_dimension_db
https://github.com/pentalpha/protein_dimension_db
Comments
There is no reason why everyone should be losing time computing the same things