Clustering methods for large molecular library screening

Today, the amount of data generated in many fields such as engineering, social sciences or medicine is suffering a tremendous scale-up. Extraction of relevant information is hence becoming increasingly challenging, and methods of data analysis such as clustering are now crucial. Clustering consists on the identification of homogeneous subgroups among a set of heterogeneous items. In the context of computer-aided drug discovery, where the chemical space is estimated to be 10⁶³ compounds, we can use it to select promising subgroups inside a large chemical library, getting rid of the bulk of the dataset with either no medical interest or redundancy a priori.

Non-hierarchical methods

Clustering algorithms vary largely on their efficiency, so some methods are more practical than others for their application on libraries of different size (Downs & Barnard, 2002). For large datasets, the most suitable are non-hierarchical algorithms, that are based in a single partition of the data. From them, probably the most widely used is K-means (Forgy, 1965), that efficiently generates a user-defined number of clusters that are iteratively updated until an optimal classification is found. Compared with the family of nearest-neighbor algorithms such as the one developed by Darko Butina (Butina, 1999), K-means produces groups with a homogeneous size, although this sometimes comes at a cost of a higher internal heterogeneity (Walters, 2019). A faster implementation of K-means, called Mini Batch K-means (Sculley, 2010) uses just a small portion of the data to update the clusters in each iteration.

Hierarchical vs non-hierarchical clustering

Our approach

To assess the utility of Mini Batch K-means, we generated a dataset by combining all sets of the Directory of Useful Decoys with the library from Specs, considered as decoys. In total, the dataset contained more than 300K structures from which just 2,2K were active compounds. Applying our proposed multistep protocol that combines clustering with 3D hydrophobic fields overlays, we were able retrieve a significant number of hits, screening less than a 5% of all compounds, and more importantly with high chemical diversity.

Combination of Mini Batch K-means with hydrophobic descriptors.

Although applying clustering methods might hide interesting structures compared to the integral screening of all compounds in a dataset, it also enables exploring a much larger chemical space that, otherwise, would be intractable. Future steps should be focused on finding the optimal number of clusters to generate, as well as increasing the quality of the chosen representative molecules.

Which is your experience with clustering large chemical libraries? How do you think we should cope with these large datasets?

Butina, D. (1999). Unsupervised data base clustering based on daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences, 39(4), 747–750. https://doi.org/10.1021/ci9803381
Downs, G. M., & Barnard, J. M. (2002). Clustering Methods and Their Uses in Computational Chemistry. In Reviews in Computational Chemistry, Volume 18 (pp. 1–40). Hoboken, New Jersey, USA: John Wiley & Sons, Inc. https://doi.org/10.1002/0471433519.ch1
Forgy, E. (1965). Cluster analysis of multivariate data : efficiency versus interpretability of classifications. Biometrics, 21, 768–769.
Sculley, D. (2010). Web-Scale K-Means Clustering. Proceedings of the 19th International Conference on World Wide Web, 1177–1178.
Walters, W. P. (2019). K-means Clustering. Retrieved from http://practicalcheminformatics.blogspot.com/2019/01/k-means-clustering.html

Clustering methods for large molecular library screening

Non-hierarchical methods

Our approach

Latest posts

Hit identification for CCR2 using ultra-large virtual screening and FEP

Dr. Alessandro Monge joins Pharmacelera as Strategic Business Advisor

Pharmacelera extends the current partnership with Enamine for the screening of ultra-large chemical libraries

Why query molecule conformation is important

Access to Pharmacelera technologies

New Graphical User Interface: MolXplore

Categories

SITEMAP

SOCIAL

Contact

CONTACT INFORMATION

HEADQUARTERS

Torre R, 4a planta, Despatx A05, Parc Científic de Barcelona (PCB). C/ Baldiri Reixac 4-8 08028 Barcelona