Papers
2024
- An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsMichael Brocidiacono, Konstantin I. Popov, and Alexander Tropsha2024
Structure-based virtual screening (SBVS) is a key workflow in computational drug discovery. SBVS models are assessed by measuring the enrichment of known active molecules over decoys in retrospective screens. However, the standard formula for enrichment cannot estimate model performance on very large libraries. Additionally, current screening benchmarks cannot easily be used with machine learning (ML) models due to data leakage. We propose an improved formula for calculating VS enrichment and introduce the BayesBind benchmarking set composed of protein targets that are structurally dissimilar to those in the BigBind training set. We assess current models on this benchmark and find that none perform appreciably better than a KNN baseline.
2023
- PLANTAIN: Diffusion-inspired Pose Score Minimization for Fast and Accurate Molecular DockingMichael Brocidiacono, Konstantin I. Popov, David Ryan Koes, and 1 more author2023
Molecular docking aims to predict the 3D pose of a small molecule in a protein binding site. Traditional docking methods predict ligand poses by minimizing a physics-inspired scoring function. Recently, a diffusion model has been proposed that iteratively refines a ligand pose. We combine these two approaches by training a pose scoring function in a diffusion-inspired manner. In our method, PLANTAIN, a neural network is used to develop a very fast pose scoring function. We parameterize a simple scoring function on the fly and use L-BFGS minimization to optimize an initially random ligand pose. Using rigorous benchmarking practices, we demonstrate that our method achieves state-of-the-art performance while running ten times faster than the next-best method. We release PLANTAIN publicly and hope that it improves the utility of virtual screening workflows.
- Transfer learning to leverage larger datasets for improved prediction of protein stability changesHenry Dieckhaus, Michael Brocidiacono, Nicholas Randolph, and 1 more author2023
Amino acid mutations that lower a protein’s thermodynamic stability are implicated in numerous diseases, and engineered proteins with enhanced stability are important in research and medicine. Computational methods for predicting how mutations perturb protein stability are therefore of great interest. Despite recent advancements in protein design using deep learning, in silico prediction of stability changes has remained challenging, in part due to a lack of large, high-quality training datasets for model development. Here we introduce ThermoMPNN, a deep neural network trained to predict stability changes for protein point mutations given an initial structure. In doing so, we demonstrate the utility of a newly released mega-scale stability dataset for training a robust stability model. We also employ transfer learning to leverage a second, larger dataset by using learned features extracted from a deep neural network trained to predict a protein’s amino acid sequence given its three-dimensional structure. We show that our method achieves competitive performance on established benchmark datasets using a lightweight model architecture that allows for rapid, scalable predictions. Finally, we make ThermoMPNN readily available as a tool for stability prediction and design.
2022
- BigBind: Learning from Nonstructural Data for Structure-Based Virtual ScreeningMichael Brocidiacono, Paul Francoeur, Rishal Aggarwal, and 3 more authors2022Publisher: American Chemical Society
Deep learning methods that predict protein–ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein–ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind’s test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.
2021
- MOLUCINATE: A Generative Model for Molecules in 3D SpaceMichael Arcidiacono, and David Ryan Koes2021
Recent advances in machine learning have enabled generative models for both optimization and de novo generation of drug candidates with desired properties. Previous generative models have focused on producing SMILES strings or 2D molecular graphs, while attempts at producing molecules in 3D have focused on reinforcement learning (RL), distance matrices, and pure atom density grids. Here we present MOLUCINATE (MOLecUlar ConvolutIoNal generATive modEl), a novel architecture that simultaneously generates topological and 3D atom position information. We demonstrate the utility of this method by using it to optimize molecules for desired radius of gyration. In the future, this model can be used for more useful optimization such as binding affinity for a protein target.