Some things that don't work in computational drug discovery

I now have a PhD. I’ve spend four years trying various hare-brained schemes to use computers to help in early-stage drug discovery, and I’ve learned a lot about what does (and, mostly, doesn’t) work in this field.

While I have learned a lot from my time, I realize I’ve been selfish with my knowledge; the vast majority of my work (the negative results) is still mostly stuck in my head. During my research, I tried a bunch of ideas and quickly moved on to the next thing when I wasn’t getting compelling results. On the one hand, this allowed me to try way more things that I would have otherwise. On the other hand, my process was mostly motivated by all the things that failed along the way; by keeping these to myself, I’m holding the field back.

In retrospect, I think the move would have been to publish blog posts on all the things that didn’t work. Putting out actual papers on these results would have been a ton of effort, but writing a couple paragraphs and showing a plot or table wouldn’t have been too difficult. Honestly that’s closer to the original vision of scientific papers before they became the monstrosities they are.

But alas, I didn’t even go the blog post route. So, in an attempt to atone for this research sin, here is an outline of my research journey, including the failures along the way (at least the ones I remember). What follows is adapted from a section in my thesis.

The research journey

It was clear to me coming into my PhD that there was a generalization problem in this space. Most ML models were trained with cocrystal structures, and I suspected that the large amount of binding data without known structures could help. My plan was to create a curated dataset leveraging this data and then spend the rest of the PhD developing models to exploit it. So I created the BigBind Dataset and found a way to train models without any pose information. But what I really wanted was a model that took as input many docked poses and have the model learn to attend to the correct one. The correct pose, after all, should be the most informative one to any model. I tried many variants of this approach, and none worked. Attention was never focused on any particular pose, and adding pose information never improved the model. It was essentially a kind of lookup table matching chemical and binding site signatures: not very interesting or generalizable. Basically, what other ML models were already doing.

Why didn’t this work? One hypothesis was that the Vina poses I was using were too poor due to the static protein structure; even when I predicted multiple poses, very rarely were any accurate within 2\AA~RMSD. Perhaps the approach would succeed with more diverse poses such that at least one was likely to be correct. This led me to PLANTAIN, a docking approach inspired by the recently published DiffDock. DiffDock had a compelling architecture but, like many such ML models, suffered from poor generalization to novel target. I figured I could do better with some modifications: the model needed to be faster, and it should be trained directly to produce cross-docked rather than self-docked poses. By explicitly learning to place ligands in poses that might intersect with residues that needed to move, perhaps the algorithm could avoid the failure mode of never producing the correct pose when a residue is in the wrong position. Ultimately, PLANTAIN was not dramatically more accurate than Vina in terms of top-N accuracy; most of the improvement over Vina was in finding a better top pose, which GNINA had already achieved.

Around this time, I became interested in better evaluation metrics for virtual screening. I was broadly interested in algorithms that might not be more accurate in aggregate than previous models but had good uncertainty quantification or a more general output for rejecting compounds it wasn’t confident in. The idea was that the top-scoring molecules would be more likely to truly bind in a realistic, massive screen of billions of compounds. This thinking led me to develop the EF^B metric and the BayesBind benchmark. BayesBind in turn revealed how difficult proper dataset splitting is: even with most rigorous splitting algorithm I found, a KNN (K-nearest neighbors, perhaps the simplest model imaginable) performed remarkably well, indicating persistent data leakage. Overall, my experiments with uncertainty and rejection criteria never led anywhere productive, but I am glad the metric and benchmark were created.

At this point I was fairly discouraged by my pure ML exploits. The increased data from BigBind did not seem to be helping as much as I had hoped. My main hypothesis was that the massive data increase was somewhat illusory: most of the data came from Structure-activity-relationship (SAR) campaigns, producing abundant measurements on close analogs of a single compound for a single protein. This is not very informative, especially since most modifications do very little to modulate affinity.

That is when I turned to physics-based approaches. I knew vaguely that Absolute Binding Free Energy (ABFE) calculations were quite accurate but prohibitively slow. Was there a way to use ML to speed them up? I learned some statistical mechanics, which led me to create DBFE. DBFE was a purely physics-based approach, but unfortunately it still fell into the speed-accuracy trap: it was not actually more accurate than MM/GBSA on the protein-ligand complexes I tested. However, this work gave me a critical insight: the core challenge in protein-ligand binding affinity prediction is the treatment of water. Explicit solvent is essential for accurate results.

Finally, all of this led me to the Ligand Force Matching (LFM) method. I had been thinking for some time about how to create an ABFE-quality dataset of protein-ligand binding affinities, but running full ABFE calculations would be prohibitively expensive. After considerable thought, the force-matching approach emerged as a viable alternative, and it turns out to work. LFM still has limitations: it does not yet provide a massive increase in accuracy and is computationally expensive. But the fact that’s it’s working, despite training models entirely on synthetic data, is really cool! I think there are many ways to improve upon this method, and generally speaking I now believe that the general approach of ML + physics is the way to go.

Some lessons learned

Pretty obvious, but worth stating: run informative baselines, and run them early! I ran a KNN baseline on the BigBind dataset too late, and I ran MM/GBSA too late on the DBFE project. My results didn’t really mean much without the context of how well the baselines were doing.
Predicting protein-ligand binding affinity is much harder than predicting their cocrystal pose, and it’s dominated by the free energy of reorganizing the waters in the binding site.