There’s a new Practical Cheminformatics post, “Some Thoughts on Dataset Splitting,” (with code and a robot cartoon) at practicalcheminformatics.blogspot.com/2024/11/some... . - ThreadSky

wpwalters.bsky.social • 100 days ago

There’s a new Practical Cheminformatics post, “Some Thoughts on Dataset Splitting,” (with code and a robot cartoon) at https://practicalcheminformatics.blogspot.com/2024/11/some-thoughts-on-splitting-chemical.html .

Comments

freundlichgroup.bsky.social•99 days ago

Great to see you over here, Pat!

grangepark.bsky.social•100 days ago

Ew how gross, the first ai slop image I’ve seen on this app so far

srijitseal.com•100 days ago

I have always wanted to write a paper titled “So you think you can Scaffold-split?”

srijitseal.com•100 days ago

My problem with Butina split for CV is that the dataset is not inherently n distinct clusters but some k clusters and then lots of random xyz molecules. This then makes some of the test folds unique and other folds the same as a random CV. We should always check if the Butina split actually worked.

wpwalters.bsky.social•100 days ago

What do you mean by "We should always check if the Butina split actually worked"? Do you have suggestions on better splitting approaches?

srijitseal.com•100 days ago

I don’t have suggestion, but I do recommend people check that what they claim as clusters are actually clusters and not groups of random compounds. Sometimes if you do a 10 fold CV with scaffold splitting, only the first few are groups of scaffolds while the rest are groups of random compounds

srijitseal.com•100 days ago

@manasmahale.bsky.social do you want to explore this :D

manasmahale.bsky.social•100 days ago

Yes, I began collecting various implementations of scaffold split, and there’s quite a variety of them. In general, your point that the scaffold split essentially resembles random sampling except in cases with clearly clustered data holds true when we examine those implementations.

manasmahale.bsky.social•100 days ago

Analogue Split creates train-test sets with a fraction (γ) of activity cliffs—molecular pairs with high similarity (above a threshold, ω) but with different activity labels. Results are visualized for metrics like accuracy, precision, recall, and F1-score across γ.

https://github.com/Manas02/analogue-split

Comments

Posting Rules

Reply