My problem with Butina split for CV is that the dataset is not inherently n distinct clusters but some k clusters and then lots of random xyz molecules. This then makes some of the test folds unique and other folds the same as a random CV. We should always check if the Butina split actually worked.
I don’t have suggestion, but I do recommend people check that what they claim as clusters are actually clusters and not groups of random compounds. Sometimes if you do a 10 fold CV with scaffold splitting, only the first few are groups of scaffolds while the rest are groups of random compounds
Yes, I began collecting various implementations of scaffold split, and there’s quite a variety of them. In general, your point that the scaffold split essentially resembles random sampling except in cases with clearly clustered data holds true when we examine those implementations.
Analogue Split creates train-test sets with a fraction (γ) of activity cliffs—molecular pairs with high similarity (above a threshold, ω) but with different activity labels. Results are visualized for metrics like accuracy, precision, recall, and F1-score across γ.
Comments
https://github.com/Manas02/analogue-split