Clustering methods for large molecular library screening

Fernando Martín — Thu, 01 Aug 2019 07:36:47 +0000

Today, the amount of data generated in many fields such as engineering, social sciences or medicine is suffering a tremendous scale-up. Extraction of relevant information is hence becoming increasingly challenging, and methods of data analysis such as clustering are now crucial. Clustering consists on the identification of homogeneous subgroups among a set of heterogeneous items. In the context of computer-aided drug discovery, where the chemical space is estimated to be 10⁶³ compounds, we can use it to select promising subgroups inside a large chemical library, getting rid of the bulk of the dataset with either no medical interest or redundancy a priori.

Non-hierarchical methods

Clustering algorithms vary largely on their efficiency, so some methods are more practical than others for their application on libraries of different size (Downs & Barnard, 2002). For large datasets, the most suitable are non-hierarchical algorithms, that are based in a single partition of the data. From them, probably the most widely used is K-means (Forgy, 1965), that efficiently generates a user-defined number of clusters that are iteratively updated until an optimal classification is found. Compared with the family of nearest-neighbor algorithms such as the one developed by Darko Butina (Butina, 1999), K-means produces groups with a homogeneous size, although this sometimes comes at a cost of a higher internal heterogeneity (Walters, 2019). A faster implementation of K-means, called Mini Batch K-means (Sculley, 2010) uses just a small portion of the data to update the clusters in each iteration.

Hierarchical vs non-hierarchical clustering

Our approach

To assess the utility of Mini Batch K-means, we generated a dataset by combining all sets of the Directory of Useful Decoys with the library from Specs, considered as decoys. In total, the dataset contained more than 300K structures from which just 2,2K were active compounds. Applying our proposed multistep protocol that combines clustering with 3D hydrophobic fields overlays, we were able retrieve a significant number of hits, screening less than a 5% of all compounds, and more importantly with high chemical diversity.

Combination of Mini Batch K-means with hydrophobic descriptors.

Although applying clustering methods might hide interesting structures compared to the integral screening of all compounds in a dataset, it also enables exploring a much larger chemical space that, otherwise, would be intractable. Future steps should be focused on finding the optimal number of clusters to generate, as well as increasing the quality of the chosen representative molecules.

Which is your experience with clustering large chemical libraries? How do you think we should cope with these large datasets?

Butina, D. (1999). Unsupervised data base clustering based on daylight’s fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences, 39(4), 747–750. https://doi.org/10.1021/ci9803381
Downs, G. M., & Barnard, J. M. (2002). Clustering Methods and Their Uses in Computational Chemistry. In Reviews in Computational Chemistry, Volume 18 (pp. 1–40). Hoboken, New Jersey, USA: John Wiley & Sons, Inc. https://doi.org/10.1002/0471433519.ch1
Forgy, E. (1965). Cluster analysis of multivariate data : efficiency versus interpretability of classifications. Biometrics, 21, 768–769.
Sculley, D. (2010). Web-Scale K-Means Clustering. Proceedings of the 19th International Conference on World Wide Web, 1177–1178.
Walters, W. P. (2019). K-means Clustering. Retrieved from http://practicalcheminformatics.blogspot.com/2019/01/k-means-clustering.html

The post Clustering methods for large molecular library screening appeared first on Pharmacelera | Pushing the limits of computational chemistry.

New user interface for PharmScreen

Enric Gibert — Wed, 09 Aug 2017 14:45:47 +0000

Check out our new user interface for PharmScreen!

Perform virtual screening campaigns with just a few clicks and take advantage of Pharmacelera’s unique 3D hydrophobic fields!

PharmScreen will find you leads with higher chemical diversity, increasing your chances of finding original scaffolds.

The post New user interface for PharmScreen appeared first on Pharmacelera | Pushing the limits of computational chemistry.

PharmScreen field-based alignment outperforms traditional solutions

Enric Gibert — Sat, 05 Aug 2017 15:54:38 +0000

PharmScreen field-based alignment outperforms traditional solutions by taking into account relevant ligand characteristics related with ligand-receptor interaction.PharmScreen superior alignment allows finding better leads in a drug discovery project.

In this figure, we can see how two cdk2 inhibitors are aligned by PharmScreen in comparison with traditional shape-based tools.

The hydrophobic fields of the two ligands are shown as a lipophilic potential (a). PharmScreen alignments (c) emphasize the optimal hydrophilic/hydrophobic overlap while the alignment based on traditional descriptors (b) misses completely the necessary balance between hydrophobic and hydrophilic properties.

The post PharmScreen field-based alignment outperforms traditional solutions appeared first on Pharmacelera | Pushing the limits of computational chemistry.

PharmaScreen Archives - Pharmacelera | Pushing the limits of computational chemistry

Clustering methods for large molecular library screening

Non-hierarchical methods

Our approach

New user interface for PharmScreen

PharmScreen field-based alignment outperforms traditional solutions