AI-assisted drug discovery is all the rage these days. It seems as if new companies and partnerships are being announced every day. Is this a new phenomenon, or an evolution of what has been done in the past? What are the key challenges standing in the way of computer-assisted drug discovery?

For the last episode of season 2 of the analytical wavelength we talk with two experts in drug discovery. Dr. Chris Lipinski is famous for introducing the “Lipinski Rule of 5”, the original prediction tool in drug discovery, which he developed at Pfizer. Dr. Olivier Elemento is a professor at Cornell University, where he studies AI-assisted drug discovery.

Read more about Percepta.

Read the full transcript

Chris Lipinski   00:00

Computations and calculations are great at a certain stage, but the real test is what happens in real life, what happens in the experiment. The real test of something is, does it allow you to make progress?

Charis Lam   00:21

How are predictive tools changing the way we discover drugs? And how long has that change been occurring?

Jesse Harris   00:27

When we think of predictive tools, we tend to think of relatively recent times. Terms like machine learning and artificial intelligence seem to have sprung up in the last decade, at least in the popular imagination.

Charis Lam   00:40

But our attempts to predict good drug candidates goes back farther than that. I’ll bet you’re familiar with one of those predictive tools—Lipinski’s rule of five.

Jesse Harris   00:50

I actually do remember that from my undergrad. So how did these first human usable predictive tools come about?

Charis Lam   00:56

Hi, I’m Charis…

Jesse Harris   00:58

…and I’m Jesse. And we’re the hosts of the Analytical Wavelength brought to you by ACD/Labs.

Charis Lam   01:03

In this episode, we’ll discuss predictive tools in chemistry, both past and present. And we’ll start the story with none other than Chris Lipinski himself. Here with me today is a scientist who will be familiar to many of our listeners. Chris Lipinski worked for 32 years at Pfizer, where he retired as senior research fellow. Since then, he’s been consulting and advising pharmaceutical companies. He’s famous of course, for, among many other things, the Lipinski’s rule of five. Thanks for joining us today, Dr. Lipinski.

Chris Lipinski   01:34

I’m very pleased.

Charis Lam   01:37

And we’re very pleased to have you. So let’s start with our usual icebreaker question. What’s your favorite chemical?

Chris Lipinski   01:44

That’s a very easy question for me to answer. It’s a compound I made in 1971 at Pfizer. The <<USIN>> name of that compound was Tolimidone, so and that, and I’m still involved with that compound 50 years later. Tolimidone was CP-26154 when I made it was at Pfizer. And then after I retired and I joined up with Melior Discovery on their scientific advisory board, I suggested that compound as the first compound for Melior’s independent drug repurposing efforts. And the people at Melior discovered that the compound had a spectrum, effects on glucose tolerance, positive effects, improving glucose tolerance. Biologically, it was acting like an insulin sensitizer. Melior people discovered that it had a completely novel mechanism. It was the, this was in about 2007, maybe. It was a lyn kinase activator and the activity had never been reported before.

And it still is the only known lyn kinase activator. The compound was taken through phase two B and it was successful in meeting its endpoints and treatment of type two diabetes. So scientifically it could have progressed. But the economic analysis of the time left on an intellectual property coverage and the competing compounds in the market. So it was basically an economic analysis.

And the other thing that makes that compound very unique, it is a compound that was made in the era of before mechanistic screening. The first screens in that compound were actually in dogs and has very low molecular weight compound. It’s really in the solidly in the fragment range. And, you know, despite that, it has activity of 50 nanomoles or as a lyn kinase activator. So it’s a very, very unusual profile. So anyway, that’s, that’s my fun, fun compound and my favorite.

Charis Lam   03:56

Oh wow. That’s definitely a good one and really cool that you’ve been working with it for decades. So you’ve been very influential in the field of chemical predictions, particularly related to the rule of five. Could you provide our listeners with a brief reminder of what that particular rule is about?

Chris Lipinski   04:15

Sure. So this is a rule that came out, internally at Pfizer, came out, I came up with it in 1975. It was published, I think it was in November of ’77, I’m sorry, not ’75, ’95. Discovered in ’95 and published in ’97. So it’s based on an analysis of compounds that received the United States adopted name or a British adopted name, which is an indication that the compound’s actually achieved, were suitable to go into phase two clinical studies; so they’d gotten through phase one. And the observation was that there was, if you looked at the distribution of compounds by a series of physical chemical parameters that were known in the literature to be related to, for example, oral absorption and solubility; the distribution of those parameters followed that pattern where approximately 95% of the compounds were in a certain range, 90 and 95% were in a certain range.

And so that the rule became a very, very simple one, which was that you are more likely to get a compound with solubility good enough to achieve something orally, or permeability good enough to achieve some kind of activity orally. If the molecular weight were less than 500, the number of hydrogen bond donors was five or less. The number of hydrogen bond acceptors was ten or less. Or if the like the calculated log P was five or less.

And then that rule is actually for the types of compounds that were, that it was based on which we were all pre-1995 compounds, has held up very, very well with time. But the problems that existed in 1995 on the solubility front and permeability front. The solutions for solubility came much earlier. They came in around, started to come in around the late 1990s or 2000, the much more difficult problem of enhancing permeability for difficult compounds. They really have reached their fruition sort of, I would say in the last five years. So rule of five actually provides a good benchmark of the kinds of compounds that we really like to have in drug discovery. If you get them to succeed with the least amount of effort. It is possible to get beyond rule of five compounds and to get them approved; to get FDA approved drugs. But the degree of difficulty is very much more difficult and, and the possibility, the range of compounds are likely to be successful is less, but it is still possible.

So… and that was, the improvement in solubility and permeability was not predicted in the rule of five because actually a lot of those technologies were either immature at 95, 97 time period or in the case of permeability, basically at the very, very most nascent stages and very little published literature on them.

Charis Lam   07:29

Right. So before the rule of five came about, which helped scientists pick those drug candidates which might be most efficient, how were drug candidates prioritized? Did researchers rely on their intuition or something else?

Chris Lipinski   07:44

Oh, actually, I would say that there is always intuition, and the best scientists actually usually has intuition, but no, drug candidates were chosen based on what were very logical rules at the time. With one major exception and that major exception was that, sort of the late 1990s was the time of high throughput screening and there was a tremendous focus on sort of reductionist thinking and a huge focus on just potency. And people forgot about the types of properties that the drug needs apart from potency. That the compound needs apart from potency to actually become a useful drug. And that time period was the time period where it was the drug discovery was essentially almost wholly dominated by pharmaceutical companies, and academic drug discovery really hadn’t really come in at that point in time.

So I would say the drug companies, they knew what they were doing and they had highly well worked out processes for selecting drug candidates. But it was in the era of high throughput screening, which definitely, you know, affected the world view about what you needed for a drug candidate. And also it was a time period, the late 1990s where they tended to be in many organizations, including at Pfizer, the discovery departments would tend to be in one part of the organizational structure and then the drug metabolism and the pharmaceutical sciences were almost universally in another department, in a development department, and so that also sort of hindered the ability of… to take into account the more development properties and to bring them backwards into the discovery process.

Charis Lam   09:46

So obviously there were some very systematic processes in place even back then in ’95, but there’s been a lot of improvement since then as well. You were talking for example, about our understanding of drug permeability. So how do you think our understanding of what makes a good drug candidate has evolved over the last few decades?

Chris Lipinski   10:05

So I would say, you know, the reason Rule of Five came out in 1997 is the timing was perfect. Effectively my lab at Pfizer, I think was worldwide, the first laboratory in a discovery setting that really went after the measurement of the physical chemical properties in a discovery setting. And that meant that, you know, we had to take some shortcuts. You know, when we measured solubility, we didn’t measure thermodynamic solubility, we measured kinetic solubility, but that didn’t matter because our chemists effectively making almost entirely amorphous compounds, so you know, the crystalline state on the solubility really doesn’t matter.

And what we really were trying to do was to influence their choice of compounds, and we were successful at that. We actually rolled the rule of five out at Pfizer internally in 1995, and it was moderately successful. But you know, medicinal chemists are a smart bunch and a lot of them would say, well yeah, this is a calculation, you know, and yeah, but I don’t really believe too much the calculations. I mean there was some skepticism, a considerable degree of skepticism. So what we did is that within about a year of the time we rolled it out, we actually came up with discovery, kinetics, solubility assays that would, they were capable of testing 10,000 compounds a year, which was enough to keep, at that time, was enough to keep up with the medicinal chemistry output. There was some other management–scientists interactions that helped this along.

So, you know, our chemistry managers knew that we had a problem with, we’d nominate a clinical candidate and the poor pharmaceutical sciences people would just throw up their hands and say, what are we supposed to do with this compound? The properties are so bad, you know? So how do you solve that problem?

Well, the first way to do it is maybe try to identify compounds, maybe by a calculation a priori, but really are likely to have you know, those miserable properties that the pharma-sci people can’t handle. And that’s where the rule of five came in. The key decision that was made and that you have to give credit to the person who at that time was in charge of our Discovery Computational Chemistry Group, who also had control of the compound registration system. I worked with him. His name was Beryl Domini, and he put the rule of five into our registration system. So as soon as registering the compound, you know, entered it into the registration system and the compound structure went in, in the background, immediately, all the rule of five parameters were calculated. And if there was a problem with one or more of the parameters, they would see that at the time of registration. So the advantage of that was that it completely eliminated any excuse as well—we never heard of this rule of five stuff, you know, and we don’t need to know how to do it—well, it appeared in the registration, so they couldn’t say that.

But where the managers came in, as all the chemistry managers found out about this and they found out they could get a printout or they could see it online, so they could actually track their people in their groups and see, you know, who are the people who are making the more reasonable compounds. I still remember quarterly meetings where, you know, manager would ask, you know, well, you know, how is your compound compared with the rule of five? And then the medicinal chemists would be on the spot and they couldn’t say, well, I never heard about it or I don’t know. But nevertheless, we still had some reluctance. And then once we rolled out the solubility assay, the discovery solubility assay, then the medicinal chemists were much more comfortable about that because it was honestly somebody… I was so many years a medicinal chemistry, you know, computations and calculations are great at a certain stage, but the real test is what happens in real life, what happens in the experiment. And so now you have an experimental assay, and that’s sort of the story of the acceptance of the rule of five and actually the whole story is published. I had an essay and annual reports in Medicinal Chemistry. I think it was 2014. There’s a good almost a page on that and the actual real story of how we discovered, how we came on the rule of five. I was working on something that wasn’t even on anything that I was supposed to be working on. It was just a summertime, and I was working on a new statistical program from SAS that I got in and I saw these graphic possibilities. Yeah. So it was just the story was that you can get a lot of things done just because they’re fun and also because, you know, if you, if you want to come up with things, you have to have time.

What kills creativity is, you know, outlook calendar, that’s full all the time. So I had all kinds of strategies so I could carve out blocks of time where I wouldn’t be bugged by phone calls and worthless meetings and so on. I would really have a chance to think. I mean, there’s actually a lot of scholarship on this frequently to come up with ideas. You can’t do it in 15 minutes. Take just 15 minutes or 20 minutes just to calm down and may take you hour and a half, you know, uninterrupted time thinking about something.

Charis Lam   15:39

Absolutely. It’s really amazing what can come out of just playing around, satisfying scientific curiosity. And it’s great to hear that the rule of five is actually one of those things. So let’s go back to… You mentioned that there was a bit of mistrust for people who thought, oh, this is just a calculator. And also that the rule of five works for many things, but that there are exceptions. So how do you feel scientists should balance their own expertise, intuition, and desire to explore alongside the guidelines that these calculators and tools are giving them?

Chris Lipinski   16:14

I do have a definitive point on that, and I think the real test of something is, does it allow you to make progress. Does it allow you to make, be more efficient, or does it allow you to actually succeed? And that’s really to me, more important than something like, oh, the correlation of calculated value versus experimental value. Right. I mean, that is important. But the more important thing is the bottom line. And I will say on the rule of five, I had tremendous internal support at Pfizer at that time from the pharmaceutical sciences people, even though the way I was doing my experimental assay broke every rule that’s in the textbook at that time on how you should measure solubility, because we were doing it by detection of precipitation using the diode array U.V. machine. I mean, it doesn’t appear anywhere in thought, in a textbook. It’s, in the classical way, it’s an incorrect way of measuring solubility. But the point is that the pharmaceutical sciences people were tremendously frustrated by the quality of the compounds they were getting from the discovery people. And anybody who could come up with anything, even if it wasn’t a classical approach, that could improve things for them so that their job would be easier, and they could turn turning a discovery compound into a compound that the formulation and the characterization, that would be successful into, so it could be passed on to the clinical people to go into phase one. I mean, that’s what they wanted. And so what I was doing, even though it broke the rules, it made their job easier and overall made the company more successful. So, I actually think there’s nothing wrong with being skeptical about calculations and there’s nothing wrong about being skeptical of about experiments.

So you should be skeptical on both sides. There are undoubtedly errors in calculations. My attitude is a lot of times the exact correspondence of the calculated value to some experimental value is maybe less important than the trend you see, because you want to move medicinal chemistry in the correct direction. So maybe you’re off by a factor of three or something, but as long as it’s a fairly consistent error that you’re off, it doesn’t matter because you’re going to be moving in the correct direction.

And then the other thing that’s not appreciated is the error that you get in experimental physicochemical measurements. So again, this time frame of the late 1990s, 2001-2002, I actually worked quite closely with a Pfizer manager who was a very, very good chemist who was in charge of environmental packages in our compounds, and as part of that, you know, we had to get experimental logP measurements, pretty accurate measurements, of some pretty high lipophilic compounds which is incredibly difficult to do and horribly expensive. I mean we were spending, you know, experiment, and this is back in like 2000, to get together a compound whose, let’s say, whose logP was around five, to get an acceptable experimental value that we could submit to the EPA. I mean, that would cost us back then, you know, $20–25,000. So we had to come up with radio labeled material, and show that the partitioning of radio labeled material into the lipophilic phase was due to the compound itself and not some radioactive impurity. And we had to do it at two concentrations. It was really, really difficult. I still believe today that most of the experimental logP values of, let’s say five or higher, are probably wrong. And the usual pattern is that those values are far too low than the true value, which is in some ways kind of meaningless because at high logP values compounds, you know, they’re certainly not discrete monomeric compounds. They usually self aggregate and highly protein bound in biological media and so forth.

So I think having a suite of software that you’re comfortable with that you use routinely that covers a range of properties that you can, that are calculated, is extremely useful. The earlier it’s used, the higher the value, and then as time goes on, late discovery and certainly by the time you get into clinical and the clinical parameters take over, so experiment always trumps calculation and calculation has its highest value when either the experiment is difficult to do, which can easily be the case with a compound with difficult properties, or when it’s just, for example, impossible to do. For example, when you’re you actually planning out what you’re doing so compounds don’t actually exist. You know, a competent medicinal chemist always should have a very good sense how their compounds behave. What is responsible for the potential solubility of permeability issues? What can they do in chemistry to try to fix those?

Charis Lam   21:56

That’s really useful advice in terms of keeping the end goal in mind, using all the tools at your disposal, and always just trying to move in the right direction whenever you’re designing compounds.

Chris Lipinski   22:08

And also, you know, one of the things, though, after I retired, I’m still doing a little bit of it, did an awful lot of legal consulting work and working and mostly on patent litigation. One of the things we would always get from people challenging a patent was to say, well, you know, you guys are reporting activity for a compound or you’re just reporting an IC50, but you’re not bracketing it with 95% confidence limits. And that was very typical for chemistry publications, and the journals allowed it. And there’s a reason for it, that, a biology experiment typically it’s a group of animals and you’re getting some value out, but it’s essentially a single compound against a single test, maybe repeated enough times but that’s very, very different than when you’re doing medicinal chemistry SDR when you’re looking at the pattern of activity of a compound. And so a single point is much stronger. Okay. There might be an error there and you don’t know about it, but you’ll pick it out in the pattern of compounds because, you know, you go along and you say, wow, this compound is active. Well, why wasn’t that compound back there, you know, active? Then you go back and retest. Oh, yeah, it really isn’t active now. So the errors tend to self-correct and medicinal chemistry that I was always faced with the decision of what compound I’m going to make next and I’m going to have to balance all of the, oftentimes, competing data on the compound. You know, so balance potency versus solubility and permeability and, you know, metabolic stability and whatever standard assays you have that would eliminate a compound, you’ve gotta pass those. So it’s very often, in fact most of the time, the compound that gets nominated is the compound that is the compromise. It’s got the best pattern of activity against those parameters that you think are important, but against any one parameter it may not be the best compound, but it’s the best set.

Charis Lam   24:22

Definitely a lot of things to keep in mind here. You’ve shared a lot of great experience that stories with us today what do you think is the most important advice for our listeners to take away?

Chris Lipinski   24:32

You have to understand your compound. You have to understand what its properties are and when you’re thinking about making something or planning something out, you need to know, whatever compounds you’re planning, how you might predict that they might behave. The kind of thing that was done in pre-rule of five, which was to solely focus on potency just, at least for drug discovery, true drug discovery, is a total loser. And that, by the way, is the reason why the rule of five took off from time it was published in ’97 it just, every year the number of references to it have, you know, have increased; now there’s like approaching 21,000 references on it, you know, and it’s, this year it’s already more than it was last year, which is more than the year before. And there’s a reason for it, which is that it’s useful. And the other thing that’s very useful about it, is it’s simple, it’s very simple, it’s been criticized as being too simple. You don’t need a computer program. You just look at the structure and except for the logP, which you do need to have some calculation for, but the other parameters are just, you can get it right from the chemical structure. It’s not a problem.

Charis Lam   25:46

Thank you very much for your contributions to medicinal chemistry, of course, but also here today on our podcast. It was a pleasure to have you.

Chris Lipinski   25:52

Okay, bye.

Charis Lam   25:55

Take care. Dr. Lipinski, give us a wonderful history of what predictions used to look like and how he came up with the rule of five. That’s still famous within the industry.

Jesse Harris   26:05

Of course, chemical property prediction is still important these days, and if you want to learn more about that, you should check out the Percepta Products on the ACD/Labs site. You can find a link for that in the show notes. But there are new fronts in the efforts for computer assisted drug discovery.

Charis Lam   26:23

To hear about current advances in machine learning for drug discovery we turn to Dr. Olivier Elemento.

Here with us today we have Dr. Olivier Elemento, who is director of the Englander Institute for Precision Medicine, associate director of the Institute for Computational Biomedicine, and a professor of physiology and biophysics at Weill Cornell Medicine. He has published extensively on the use of AI and machine learning in topics related to drug discovery. And we’re really excited to have you here today. Welcome, Olivier.

Olivier Elemento   26:55

Thank you so much for having me.

Jesse Harris   26:56

Very excited to talk to you today. I want to start off, though, with our regular icebreaker question. What’s your favorite chemical?

Olivier Elemento   27:02

Well, that’s a great question. You know, I don’t know if I have one favorite chemical, but I’ve always been very impressed with natural products and drugs that come of natural products. I’m always very impressed with, you know, the amazing sort of tricks and the amazing creativity that, you know, millions of years of evolution have created. I think it’s just amazing to see these, you know, natural products. You know, we do a lot of work in cancer and a lot of cancer drugs are coming from natural products. And, you know, it’s crazy to see the complexity of these molecules that, you know, essentially have been designed by, you know, billions of years of evolution. It would be impossible, possibly for human beings to design them from scratch. You know, that’s you know, kind of evolved, you know, in there as a way to enable organisms to kind of, you know, defend themselves and fight each other and so on. And just amazing how complex these molecules can be. And we’re, you know, obviously, we’re interested in getting some inspiration from the complexity of these molecules.

Charis Lam   28:00

Yeah, definitely nature is an amazing designer of molecules. So moving on to our main topic of today, how did you get into this sort of intersection between A.I. and drug discovery? Did you start as a computational scientist, or did you start on the biology and science side?

Olivier Elemento   28:18

Well, thank you so much for asking this question. So my background is initially engineering. That’s how I started my studies. I was always very interested in computer science and was also interested in biology. And I was actually very interested in the application of computer science to biology. And that’s why I did a Ph.D. in computational biology. And it was very rewarding. I really enjoyed it very much.

Sort of later on, I joined Cornell and started working on a program focused on precision medicine here at Cornell. And, you know, that program involves sequencing a lot of genomes, especially cancer genomes. You know, we’ve been focusing on trying to understand what’s driving individual cancers, the mutations that make each cancer unique, and especially the actionable mutations, what we call actionable mutations, which is the mutations, for which there is potentially a drug that can target the product of the mutations, and as a growing number of such mutations, as you know. That all being said, as we’ve been sequencing genomes, we also realized that there’s a very large number of mutations that are not actionable. You know, very often you don’t understand what these mutations are doing. There’s no drug that targets these mutations. Sometimes you actually start understanding the function of these mutations as you do experiments, you can model these mutations in a variety of different ways using mouse models or cell models, and always trying to understand the better. But nonetheless, you know, there’s no existing drug that can target the product of these mutations. And that’s really actually for me, it was almost an a-ha moment in a sense that we sort of started realizing that maybe there’s the potential to use A.I. as a way to create new drugs that could potentially target these mutations whose function we just did not understand or whose presence, you know, we could not act upon when we see them.

And so that’s what started a lot of the ideas in my lab intensive using A.I. as a way to develop new drugs and discover new drugs. And there’s a lot that we’ve done in the past on this, a lot more than we can do in the future. I think it’s a very interesting space. You know, there’s a lot of potential to use A.I. and data science as a way to transform how we develop drugs. And, you know, a lot of my focus, a lot of our group in our institute focus on these kind of problems.

Jesse Harris   30:41

Great. That’s very fascinating. Now, I want to talk specifically about how you see A.I. as solving these problems. People can say, like, oh, just throw AI at things and it’ll solve it. But so why do you think that A.I. is a good approach for solving these drug discovery problems? And that’ll be helpful for developing drugs for the targets that you’re thinking of.

Olivier Elemento   31:02

Yeah. So we are using A.I. in lots of different ways, trying to address complex, difficult problems in drug discovery using A.I. one of them, just to give you an example, and that’s something that we worked on early on, was trying to predict ahead of time if a molecule is going to be toxic in human beings. So right now, as you know, this is not a simple problem to solve, right? It’s typically addressed using mouse testing or animal testing, a different client. And, you know, even so, you know, once you have molecules that do not seem to be toxic in animals, as we know, many of these molecules, are nonetheless toxic in humans, fails phase one trials, for example, or even fail later. So we’ve been trying to see, for example, what if we could address this kind of problem using a data science and AI approach?

A few years ago, what we did was to compile a very long list of molecules that failed in clinical trials because of toxicity reasons. We also made a very long list of molecules that did not fail; that went all the way to approval. And then we asked essentially, you know, can we learn something from these molecules? Can we learn something from the molecules that failed versus the ones that did not fail that was eventually got approved? Can we describe the molecules using different features of these molecules? Can we describe the molecules in terms of what they target in human cells? Can we describe the targets of these molecules, what they interact with in cells? Where they are expressed in different cells in the body? We made a list of features for each of the molecules, and then essentially, we used a machine learning method to predict, or to try to predict, which molecules failed and which molecules succeeded in the drug discovery process.

And as it turns out, this very simple approach, right, gave us, gave rise to a model, a predictive model that was actually quite good at predicting which molecules failed in clinical trials, in phase one trials. We validated these predictive models in different ways, but it just gives you a sense of the power of the AI to kind of learn from all the data that we have access to, to address those specific questions.

I think part of the reason why, you know, there’s a bit of hype in the field and maybe A.I. has not delivered, if you want, as much as some people thought it could. It is because sometimes you hear, like you say, that A.I. is being applied in general, what we really believe in is really application of A.I. to specific questions, specific problems. You know, can we predict if a drug is going to be toxic or not? That’s very specific. There’s data that exists, there are, labels, you know, that’s something that’s important. In A.I. you need to have specific things to predict? You know, we call them labels, like, it’s toxic or not toxic; very simple label in some ways. And then you can describe the drugs using, you know, the features that I mentioned. That’s essentially a well-formed machine learning problem. We thought that maybe it could work. You know, we took some inspiration, some from what’s happening in other fields. For example, you may have seen this movie called Moneyball with Brad Pitt, where you know, there’s, it’s coming from the book, but essentially, it’s the application of data science to predicting the extent to which baseball players would be successful in the future. That’s exactly the same approach I’ve described. You know, you take a lot of players who’ve been successful, a lot of players who haven’t, you describe them using different sort of features and then you try to predict the future. And they worked, in baseball, we did the same thing for drugs, and it works, you know, as well. So, you know, I think there’s something to be said about this kind of approach.

Charis Lam   34:36

Absolutely. So you talked a bit about the list of features that you need in order to build your model, and, I think one of the challenges in some fields at least, is getting the data to feed these models. So where do you look for the source of data for these features, and how long does it take you to train one of these models?

Olivier Elemento   34:53

Yeah. So in this particular problem that I mentioned, the problem of predicting toxicity of, there’s a great source of data, which is the very large number of clinical trials that failed and the trials that succeeded, that sort of moved into phase three and then eventually approval. So there’s a website for this called and then, you know, once you have information about which drugs failed or not then there’s an additional database, for example, PubChem or Drug Bank that have information about the molecules and information about how to describe the molecules or to discover the targets of the molecules and so on. So there’s a lot of data that sort of exists in the public domain.

One of the challenges of the field, and that’s something that my lab is really trying to specialize in, is data integration, because I think the challenge that we see across the board is that there’s a lot of data, but it’s coming from different databases, and it’s often not connected. So our lab has made a lot of effort to try to build a machinery to be able to connect different pieces of data together, so that we can sort of address complex questions like the ones that, you know, that I discussed with you. But we’ve applied similar types of machine learning, for example, to predict what’s the target of a given molecule, right? A lot of molecules come out of phenotypic screens, as you know. You know, we know that they do something interesting, but we don’t necessarily know how they do so.

And, you know, we’ve been thinking that we could potentially use machine learning as a way to predict the target of a molecule. And the same thing, right? There’s a lot of known molecules with known targets. Use that information as a way to train how to connect a molecule to its target based on the features of the molecule. And likewise, you know, the secret for this was data integration, was to realize that different ways to describe a molecule, you can use, for example, what happens to cells. If you treat cells with a molecule, you see that there are genes whose expression goes up and down. That’s information about the target, but it’s not the whole information. You can treat, let’s say a hundred cell lines with the same molecule. Some cell lines will die, some cell lines, you know, we’ll just, you know, essentially not care about the treatment, it’s like the profile of cell killing, that also has information about a target. It doesn’t tell you the target, but it’s the digital piece of information. You can have side effect inflammation. You can have the structure of the molecule itself in that project. What we did was to integrate all these pieces of data together. We basically say, look, each one of them gives us a little signal about what this molecules is doing, what about if we integrate all this information together? And as it turns out, we saw very clearly that the integration of these data types led us to very good performances when it comes to predicting the target of the particular molecules.

So, you know, I think this is something where we learn that data integration is key. It’s not easy. Kind of like you said, you know, the challenge is to compile data that’s connected to each other. Databases are not typically connected, so we have to work to connect them. As I said, we built a platform for doing this. But once you do, I think you have a very powerful data set that you can use to address all kinds of different questions.

Jesse Harris   38:06

Lovely. And I love hearing all of these examples that you’ve shared from your research. I was wondering, though, if there was any other examples of things that you’ve done that were particularly exciting achievements from the application of AI that you could share with the audience?

Olivier Elemento   38:21

Yeah, absolutely, plenty of projects, you know, one of the projects that I think is has been really exciting to me and to us here at Cornell is to be able to analyze pathology slides for interesting signals. So as you may know, every time that somebody is diagnosed with cancer, the diagnosis is coming from a piece of tissue from the cancer that’s stained with specific dyes, with the idea of being able to understand the architecture of a tissue, the morphology of cells, in the tissue that’s, you know, suspected to be cancer. And so from that kind of tissue analysis, one can make a diagnosis, the precise diagnosis about the type of cancer.

These tissue sections are very often digitized. And, you know, and so they exist now by the billions. You know, there are billions of digitized pathology slides that exist in pathology databases. Their companies are now, you know, have really enhanced the field when it comes to digital pathology. The point is that there’s a lot of data that exists.

One thing that we did, it was published a few months ago, was to try to see if we could extract information about these slides, for example, to tell us what kind of mutational process is happening within the tissue, within the cancer. We essentially compiled a very large database of pathology slides for which we also had genomic information. We had information that could tell us whether the mutational process was happening in some of its tumors and not others. And using A.I. we were able to connect the images of a tissue to the mutational process that was happening inside a tissue based on genomic information. And we’re able to make this automatic connection. And now the idea is that we built a model that can just take any slides of these pathology slides and is able to tell without actually having to do genomic analysis if the mutational process is active or not. So I think that’s a great example of how, you know, we can potentially get a lot of useful information out of pathology slides, including potentially about mutational process, some of which are potentially targetable by drugs. You know, that could be used maybe as a way to stratify patients; find patients that have a particular profile. Maybe these patients will respond better to a particular drug.

Charis Lam   40:40

So those are some really great successes. But I also wanted to ask about the flip side a bit. What do you think are the greatest challenges right now to making A.I. a more productive part of the drug discovery process?

Olivier Elemento   40:53

Well, you know, I think part of a challenge is that A.I. good at interpolating, it’s good at sort of learning from the labels that, you know, we have, it’s good at learning from the data that we have. It’s not always good at extrapolating. So, for example, we were discussing earlier molecules, natural products, these are really complex molecules. If you look at the drugs that we’ve designed, that human beings have designed, or, you know, that came out of different types of screens or, you know, sort of targeted screens, you know, they all kind of fall into specific scaffolds, you know, specific categories. It’s very hard for A.I. to learn from existing data and kind of think about the universe of drugs that is outside the universe that it’s learned from. So A.I. is good at interpolating, it’s not great yet at extrapolating.

And I think that’s a bit of a challenge. You know, I think this is something that we need to work on as a field, sort of understanding better how to extrapolate and not just focus on the universe on which we’re learning our A.I. models. You know, that’s, that’s the challenge to some extent.

The other challenge that I think we see and that maybe was applicable to this pathology A.I. project that I mentioned, is that we’re able to make these great connections between, let’s say, images and specific processes, let’s say a mutational process, you know, and so now we can essentially apply it an A.I. model to a tissue section and it tells you a patient has these mutational positives, for example.

But by and large, the A.I. models that we create, you know, can’t really sort of explain themselves very well. We have a limited ability to understand precisely in sort of, in terminology that, you know, we can understand and communicate with each other about, you know, what’s happening within these A.I. models. And I think that’s something we need to address as a community. You know, we need to make these A.I. models more explainable. And I know that, you know, some people are working on different techniques to do this. One of them is this class activation maps, right. Where basically what you’re doing is to, you know, take an image and ask the machine learning model to tell you the pixels, where it’s looking and images.

And that’s okay. You know, that’s a good starting point, but I don’t see it as a great starting point for a discussion. Right? Because if you show a bunch of pixels to a lot of human beings, it’s hard to have a discussion about what is this pixel, right? It’s not information that’s useful when it comes to, for example, a physician explaining what a machine learning model is doing.

Because, well, you know there’s a bunch of pixels that are blue and you know, that’s not going to help much in terms of interpretation. So I do think that we need to come up with better ways to interpret models. This is something that the community is working on, I think there’s a lot of work to do there. But I think this is a limitation.

Jesse Harris   43:45

Excellent. I think was a really interesting summary of some of the questions that A.I. is facing right now. Are there any other final thoughts that you wanted to share with us around the trajectory of A.I. or where you think things are going in the future?

Olivier Elemento   43:59

You know, I think there’s a lot of potential in the field. I always say that I really believe that the way we develop drugs, I think, is going to be completely changed using data science. There’s a lot of data that we can learn from. I think the drug industry is starting to embrace this idea that you can learn from existing data and use that data as a way to do better, sort of be faster, fail less often. It’s amazing potential. I do think, as I say, that the key is going to be to address specific problems. There’s a lot of also interesting work to be done. I think in my view, for example, to understand where we have data gaps because, you know, a lot of us are kind of walking on, so the data that already exists, you know, and I think we need to kind of map out the universe of, you know, some of the things that we measure, for example, in drug structures, understand where there are gaps. And I think there’s going to be a lot of really interesting maybe things to discover in those areas. We haven’t gone there yet at all. And I think this is important. We shouldn’t just be satisfied with the data that we have. We need to think about what data we need in the future. And I think that’s also something that is maybe not told as often as I would like it to be told.

I think there is really great potential to better understand the data universe and kind of fill the gaps once, you know, once we understand that universe.

Charis Lam   45:20

And I think that’s a really great message to take away for scientists who aren’t necessarily working in machine learning, but there is data that they can contribute to it as well.

Olivier Elemento   45:28


Charis Lam   45:29

Yeah. Well, thank you so much Olivier for joining us today. It was great to have you.

Olivier Elemento   45:32

Thank you so much for having me. It’s been a pleasure.

Jesse Harris   45:36

Thank you for joining us.

Charis Lam   45:37

All right. Take care.

Jesse Harris   45:38

From the rule of five to the latest in machine learning, this episode has taken us on a tour through the development of predictions and drug discovery.

Charis Lam   45:47

It’s been quite the journey, and it’s the perfect conclusion to our season, which has been all about predictions—how they work and how they compliment scientists’ work.

Jesse Harris   45:56

We hope you’ve enjoyed following along with us on this journey.

Charis Lam   45:59

Jesse will be returning next year for season three. This is my last season hosting the Analytical Wavelength, but I’ll definitely be tuning in future episodes. If you want to join me, remember to hit subscribe.

Jesse Harris   46:11

Keep an eye out for some bonus content that might be dropping in the feed between seasons.

Charis Lam   46:17

Thanks everyone for joining us on the Analytical Wavelength.

The Analytical Wavelength is brought to you by ACD/Labs. We create software to help scientists make the most of their analytical data by predicting molecular properties and by organizing and analyzing their experimental results. To learn more, please visit us at

Enjoying the show?

Suscribe to the podcast using your favourite service.