Papers | Michael Brocidiacono

2025

Look mom, no experimental data! Learning to score protein-ligand interactions from simulations

Michael Brocidiacono, James Wellnitz, Konstantin I. Popov, and 1 more author

Jun 2025

arXiv:2506.00593 [q-bio]

Abs

Despite recent advances in protein-ligand structure prediction, deep learning methods remain limited in their ability to accurately predict binding affinities, particularly for novel protein targets dissimilar from the training set. In contrast, physics-based binding free energy calculations offer high accuracy across chemical space but are computationally prohibitive for large-scale screening. We propose a hybrid approach that approximates the accuracy of physics-based methods by training target-specific neural networks on molecular dynamics simulations of the protein in complex with random small molecules. Our method uses force matching to learn an implicit free energy landscape of ligand binding for each target. Evaluated on six proteins, our approach achieves competitive virtual screening performance using 100-500 {}mu\s of MD simulations per target. Notably, this approach achieves state-of-the-art early enrichment when using the true pose for active compounds. These results highlight the potential of physics-informed learning for virtual screening on novel targets. We publicly release the code for this paper at https://github.com/molecularmodelinglab/lfm under the MIT license.

2024

An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models

Michael Brocidiacono, Konstantin I. Popov, and Alexander Tropsha

Jun 2024

Abs

Structure-based virtual screening (SBVS) is a key workflow in computational drug discovery. SBVS models are assessed by measuring the enrichment of known active molecules over decoys in retrospective screens. However, the standard formula for enrichment cannot estimate model performance on very large libraries. Additionally, current screening benchmarks cannot easily be used with machine learning (ML) models due to data leakage. We propose an improved formula for calculating VS enrichment and introduce the BayesBind benchmarking set composed of protein targets that are structurally dissimilar to those in the BigBind training set. We assess current models on this benchmark and find that none perform appreciably better than a KNN baseline.

2023

PLANTAIN: Diffusion-inspired Pose Score Minimization for Fast and Accurate Molecular Docking

Michael Brocidiacono, Konstantin I. Popov, David Ryan Koes, and 1 more author

Jun 2023

Abs

Molecular docking aims to predict the 3D pose of a small molecule in a protein binding site. Traditional docking methods predict ligand poses by minimizing a physics-inspired scoring function. Recently, a diffusion model has been proposed that iteratively refines a ligand pose. We combine these two approaches by training a pose scoring function in a diffusion-inspired manner. In our method, PLANTAIN, a neural network is used to develop a very fast pose scoring function. We parameterize a simple scoring function on the fly and use L-BFGS minimization to optimize an initially random ligand pose. Using rigorous benchmarking practices, we demonstrate that our method achieves state-of-the-art performance while running ten times faster than the next-best method. We release PLANTAIN publicly and hope that it improves the utility of virtual screening workflows.
Transfer learning to leverage larger datasets for improved prediction of protein stability changes

Henry Dieckhaus, Michael Brocidiacono, Nicholas Randolph, and 1 more author

Jun 2023

Abs

Amino acid mutations that lower a protein’s thermodynamic stability are implicated in numerous diseases, and engineered proteins with enhanced stability are important in research and medicine. Computational methods for predicting how mutations perturb protein stability are therefore of great interest. Despite recent advancements in protein design using deep learning, in silico prediction of stability changes has remained challenging, in part due to a lack of large, high-quality training datasets for model development. Here we introduce ThermoMPNN, a deep neural network trained to predict stability changes for protein point mutations given an initial structure. In doing so, we demonstrate the utility of a newly released mega-scale stability dataset for training a robust stability model. We also employ transfer learning to leverage a second, larger dataset by using learned features extracted from a deep neural network trained to predict a protein’s amino acid sequence given its three-dimensional structure. We show that our method achieves competitive performance on established benchmark datasets using a lightweight model architecture that allows for rapid, scalable predictions. Finally, we make ThermoMPNN readily available as a tool for stability prediction and design.

2022

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

Michael Brocidiacono, Paul Francoeur, Rishal Aggarwal, and 3 more authors

Jun 2022

Publisher: American Chemical Society

Abs

Deep learning methods that predict protein–ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein–ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind’s test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.

2021

MOLUCINATE: A Generative Model for Molecules in 3D Space

Michael Arcidiacono, and David Ryan Koes

Jun 2021

Abs

Recent advances in machine learning have enabled generative models for both optimization and de novo generation of drug candidates with desired properties. Previous generative models have focused on producing SMILES strings or 2D molecular graphs, while attempts at producing molecules in 3D have focused on reinforcement learning (RL), distance matrices, and pure atom density grids. Here we present MOLUCINATE (MOLecUlar ConvolutIoNal generATive modEl), a novel architecture that simultaneously generates topological and 3D atom position information. We demonstrate the utility of this method by using it to optimize molecules for desired radius of gyration. In the future, this model can be used for more useful optimization such as binding affinity for a protein target.