Introduction to Chemical Predictions and Machine Learning

Computational chemistry tools are becoming more powerful every day. Machine learning, artificial intelligence, big data… what does it all mean and how does it apply to the chemistry lab?

Season 2 of the Analytical Wavelength explores how predictive technology is shaping the future of science. In this episode, we explore the fundamentals of machine learning, and separate the reality from the hype.

Read the full transcript

Valery Tkachenko 00:00

The reason why there is such hype about artificial intelligence in the community now, because people think that artificial intelligence can do everything. And whatever type of chemist, you are, there will be possibilities for the applications of machine learning and artificial intelligence to help you in your processes.

Charis Lam 00:29

Machine learning, artificial intelligence, big data. There are all the rage these days. But what do these terms really mean?

Jesse Harris 00:36

And what are the realities behind the myths and hype? Can machine learning actually help scientists?

Charis Lam 00:43

Hi, I’m Charis.

Jesse Harris 00:45

I’m Jesse, and we’re the hosts of the Analytical Wavelength, a podcast, about chemical information and data brought to you by ACD/Labs. In this season, we’re going to be exploring the role that predictive tools play in chemical and pharmaceutical research.

Charis Lam 01:00

For our first episode, we want to provide an introduction to machine learning. What it is, what it can do and how it might be applied specifically in chemistry.

Jesse Harris 01:09

To answer these questions, I spoke with Valery Tkachenko, CEO and president of Science Data Experts, who specialize in developing machine learning tools for chemists.

Charis Lam 01:20

Let’s hear what he had to say.

Jesse Harris 01:22

Joining me today is Valery from Science Data Experts. How are you doing, Valery?

Valery Tkachenko 01:28

Pretty good. Thank you.

Jesse Harris 01:30

Good, good. Here to talk to us about an introduction to AI and machine learning in the chemical sciences. So, I want to start off with our ice breaker question. What is your favorite chemical?

Valery Tkachenko 01:44

That’s a difficult question. After some short thinking, I would say it’s a beer. Does that count as a chemical?

Jesse Harris 01:52

Ethanol counts as a chemical, you can choose ethanol.

Valery Tkachenko 01:55

Yeah, but you know, beer is better than ethanol.

Jesse Harris 01:59

Got it. Got it. I don’t think that the straight ethanol is particularly appetizing.

Jesse Harris 02:10

Good stuff. Good stuff. OK, So, I wanted to start off by defining some general terms. Topics like predictions and machine learning are pretty popular right now, but there’s some confusions about what these different terms mean. And you can even see this with some experts that have some gray area of how they define some of these core terms. So how would you define a term machine learning?

Valery Tkachenko 02:34

You may be surprised, but I keep forgetting what machine learning is, and once in a while, I google it. And the variety of answers that I get back from the Google page is really striking, and I’ll try to answer as good as I can.

But it doesn’t mean that’s the final answer, because I don’t believe their final answer exists. So, machine learning is something like simulation by machines of human intelligence. Normally it is considered to be the computer and systems which can simulate the human intelligence.

And then it’s a whole variety of what you’re trying to do from the understanding of the natural language texts, images. So whatever. Whatever has resemblance, whatever resemblance can machine provide, mimicking what humans can do, that’s probably the biggest term when it comes to the definition of machine learning is.

Jesse Harris 03:30

OK. And then how would that differ from something like statistics, which is used to do some similar things?

Valery Tkachenko 03:36

It’s another good question. And so my again, sometime ago, I tried to find the answer on the internet and basically everyone came up to some papers in a pretty respectable journals like Nature, Nature Methods and the scholars are arguing what was the difference between machine learning and statistics. So unless we want to go into that gray area, what they see I would I would say that, well, statistical is a tool which machine learning is using to draw its results.

And then it comes to those statistical models because all the predictions that machine learning is making, it’s based on probabilistic or statistical models so that let’s say statistically, it’s a it’s a tool

Jesse Harris 04:25

Great. And then how does machine learning and statistics differ from artificial intelligence is another term that’s thrown in there.

Valery Tkachenko 04:32

That’s a bit easier. So, that is where machines are trying to mimic human intelligence called artificial intelligence. So that’s a big, big, big area subdomain of that era is machine learning. And the recent hype about machine learning is actually connected to the development in the area of neural networks.

And that’s a smaller area than machine learning. So basically, if you and that’s a pretty popular picture, which you can often see. The large, large circle is artificial intelligence, then the smaller one inside is machine learning and the smaller one, so it is deep learning. And this is where from deep learning on all the hype is coming out these days.

Jesse Harris 05:19

Great. And then how about big data? Is that part of that same set of circles, or is that something that separate?

Valery Tkachenko 05:27

It’s absolutely separate, but it’s connected. Until we had access to the big data the machine learning was learning slowly. So the more the merrier.

Jesse Harris 05:41

Mm-Hmm. And a lot of these concepts are very much more popular right now, you’re kind of talking a little bit about that. So like, why is it there’s so much interest in this area these days these days?

Valery Tkachenko 05:51

And probably we need to step back just a bit. You asked the question about big data. So what is big data? Where does big data come from? Let’s look back 30 years ago. We already had internet.

That’s about the time when we got internet started, thirty years ago. So the first decade of the internet was just to tune things up. The websites were appearing. They were sharing some data, but it was their infancy years of the internet.

Then, as technology started to mature, we got more and more data coming. And now we live in the world of the big data. It mean that everybody can share their knowledge, their information, their data, whether they’re assembled in the laboratory, or some life experience.

Everybody’s sharing on social media everywhere, and the proliferation of this data is huge. So why didn’t we learn? Why didn’t we use the term machine learning before? In some respects – it’s not a single answer – in some respect, there wasn’t enough data. Now we get – thanks to the internet – now we’ve got enough data to build a statistical models.

Jesse Harris 07:10

Great. So it’s a tie in between those two, the big data and the machine learning. That sounds excellent.

Valery Tkachenko 07:16

And many other things which we we’ll probably talk about later.

Jesse Harris 07:20

Of course. OK, so these are all your interactions with these topics that are, you know, general to the discussions about machine learning. But I want to talk to them specifically about chemistry. Which of these concepts or practices relates to machine learning and artificial intelligence in chemistry?

Valery Tkachenko 07:42

Well, all of them. The reason why where there is such a hype about artificial intelligence in the community now, because people think that artificial intelligence can do everything depends on what type of chemists you are, your analytical chemistry, your synthetic chemists…

And whatever type of chemists you are, there will be a possibility for the applications of machine learning artificial intelligence to help you in your processes.

Jesse Harris 08:10

OK. So I think that probably a good way to understand that actually might be going through a bit of an example of how machine learning works.

Let’s imagine a company came to you and asks you and your company to design a tool to predict something like solubility of organic molecules, for example. Pretty straightforward. How would you go about creating something like this? What are the steps involved in creating a machine learning algorithm or a tool that uses machine learning?

Valery Tkachenko 08:40

Well, probably in a good way to understand how machines are doing it. Because, as I mentioned before, machines are mimicking the human intelligence. So if you’re a human, if you’re a chemist and you obviously whatever you, you’re a lab chemist, you’re concerned with solubility in this case.

So you’re a chemist with years of experience of working with the chemicals and you’re looking at the formula that chemicals. What will you see there? Benzene, for example, toluene… I don’t know. Some spirits. Are they soluble or not?

Well, if the molecule is polar than likely, it’s going to be soluble. If it’s not, then not. Basically you’re drawing conclusions, breaking down the molecule into the functional groups and those functional groups in machine learning terms are called features or descriptors or fingerprints.

So, most of the machine learning methods – and I’m saying most because some development in neural networks allow you to escape this common rule. But most of the machine learning are working the way that they first, with human help, for example, they designed the set of characteristics for the molecule.

In this case, it’s a molecule. What kind of functional groups your molecule is comprised of? What are the bigger fragments, what are the atoms, what is the polarity of the molecules, etc.? So, let’s assume you have the whole set of these descriptors and you pass this set of descriptors through the learning algorithm.

What’s happening inside most cases: it’s a black box. Sometimes it’s not, for example, linear regression means that you are just calculating the coefficients which will describe your model. When it comes to neural networks, it’s absolutely a black box.

You read something in, you get something out, how it works inside, it can be explainable. Just by the way, the area, a relatively new area which is called explainable AI, which you can talk about a bit later. But in most cases, machine learning models are black boxes.

So, you have given the molecule. But the reality to train machine learning all you need to have lots of molecules because that the more the merrier. The more data you have, the better the results.

Normally, not always. There are always also exceptions. That’s not an absolutely true statement. But basically, you’re breaking down, let’s say, one thousand, ten thousand, two hundred thousand molecules into little fragments and you know, the solubility of those molecules.

So, there is a machine with a loop. Data is coming in, you’re comparing what’s out with the desired outcome. If there is a difference, then you retrain the model and so forth. So, you pass these molecules through. Your insight assumption is growing up that something this is a tuned machine learning model.

When the model is tuned – and tuned means that it shouldn’t be subjective, that there should be some objective criteria, like the metrics. Machine learning comes with the metrics. How good the model? Well, the better the metrics, the better the model is.

So once this training stage is passed inside that black box, you have something tuned up and working. And then when you want to predict the properties of the new molecule, not necessarily water solubility or whatever. So you just pass that molecule on the same process repeats itself.

So that molecule has been broken into exact the same fragments as we used previously, which are called descriptors. And then the magic inside this black box happens and as an output, you get your desired property. As simple as that.

Jesse Harris 12:35

OK, that’s good. So a lot of it then comes down to the design of the actual learning components that is taking that training set and developing the model around that.

Valery Tkachenko 12:48

Actually, not because it’s probably better been seen when you…

Well, let’s say the most popular language right now for machine learning artificial intelligence is Python and in Python and just one line of the code. So you have the inputs and you know, and know your outputs and you call just one method, which is called fit.

And regardless of what machine learning algorithm is inside. It can be SVM (support vector machines). It can be linear regression, it can be classification algorithms anything and you just see fit. And the intrinsic – some details of how the stuff is trained inside is absolutely hidden from you.

That’s why so many people are jumping into machine learning because seemingly without any an understanding of the details of what you are doing, you can produce great results.

Jesse Harris 13:43

OK, if that’s a pretty interesting element of it. I didn’t realize that. So, I think that there are a bunch of misconceptions, though, around AI and ML. We’ve touched on some of this. This exists both within popular culture and within the scientific community as well. So I wanted to kind of like read off a few like statements or ideas and kind of get your reaction to them to say whether like this is this is a myth or this is overhyped, or that there’s some truth to this.

So, first one: interest in AI is very new and it’s just a fad.

Valery Tkachenko 14:13

Well, I guess both statements are false because AI, artificial neural networks, neural networks, which can do incredible things, self-driving cars, natural language processing. It’s all based on elements which are called perceptron. And perceptron, was introduced, I believe, more than 60 years ago.

Jesse Harris 14:36

Not particularly new. That’s for sure.

Valery Tkachenko 14:40

No, not particularly new. Yeah. But the problem was the computing power was not as great as we have now, and the algorithms which we currently have accessible to us were not good. That’s why the first decades, I would say four decades of existence, of Perceptron was not producing any big results and only on the like 15 years ago there was. Computers became pretty powerful.

And then the algorithm, which is called back propagation, was developed. And that and then of course, the big data show that allows us to have what we have now is artificial intelligence.

Jesse Harris 15:20

Great. Now this is one that relates to things that we kind of talked about before. AI is a black box. You can’t learn anything from understanding actual chemistry using AI.

Valery Tkachenko 15:33

That’s also not true. Yes, I am guilty myself and I could telling that “well, machine learning is a black box” because in many cases, you don’t understand how it works inside, especially in terms of artificial neural networks.

But in reality, in my example, when we were considering how soluble a substance is. What I said: we are breaking it down into components. If it’s a polar molecule, then there is a good chance that the molecule will be soluble and so on and so forth.

If you’re talking about affinity of some small molecules to large protein molecules, it’s again the shape of the molecule, which can be derived from the presence or absence of the functional groups. And there connectivity.

And as I said, there is there is a direction which is called explainable AI, which is try – despite the fact that we don’t know what’s happening inside that box is not always known. There is a direction. Which tries to do exactly that.

And there are there are ways around it. Just explain the this is a huge topic to talk about

Jesse Harris 16:48

With large enough datasets. I can make accurate predictions about anything. The quality of the data is less important than the quantity.

Valery Tkachenko 16:57

Well, that’s absolutely not true. And there is a statement which sounds like: junk in junk out. So it entirely depends on what kind of data you’re feeding inside your own algorithm. And the quantity obviously is not the critical issue.

It is the more of the more diverse. And that the keyword here is diverse, the more diverse data you have, the better you can expect to train your algorithm. And here, by the way, when I mentioned the metrics, or the quality of the model system is estimated. It is automatically calculated to numbers.

One of those numbers is called applicability domain. So let’s assume you are trained in the molecules from a not very diverse set meaning let’s say it’s the benzene rings, fused benzene rings, many of them different combinations.

And then you’re trying to predict the solubility of a completely different type of the molecule. Linear chains. It’s quite intuitively that the model you’ve trained on aromatic rings will not work well on this aliphatic chain. That’s pretty much direct answer, so the data means a lot and junk in, junk out.

Jesse Harris 18:25

Great. And then another one that is very common, I think in popular culture as well, is that AI is going to be taking everybody’s jobs. Both in society generally and in chemistry and the pharmaceutical industry between machine learning and robots. Everything’s going to be automated sooner or later. What do you think about that?

Valery Tkachenko 18:42

I’m trying to scare my son when I’m saying, well, in another few years, machines will take a job from you, but no, well, not in the near future.

There is a line, that there is a scope where the machine learning algorithms are applicable right now and the near future. I don’t know, five, ten years despite all, all the great things that they’re able to do.

There is no danger of machines taking jobs from people. Although when you’re looking at the Boston Dynamics robots, you know, doing parkour and other things, you can think, yeah, it’s just a couple of years and I will lose mine.

But remember, the chemists are scientists. The scientists is defined as a person who was a critical analytical thinker. And this is where this is, where the gap is, probably because machine learning can learn can mimic the regular behavior, some patterns, which in the visually perceptive from the data. But the science all stem those things.

Jesse Harris 19:52

I would say that it’s not so much, maybe that people’s jobs are going to disappear, but people’s jobs might change, particularly in the sciences, as they become more advanced.

Valery Tkachenko 20:00

Well, that’s already happening. I mean, just look around. The same process – solubility is a simple case, but the process of new drug discovery. It’s all based on QSAR and QSAR is machine learning. It’s pure implementation.

Jesse Harris 20:17

Great. Now with that, I think it has covered a lot of great introductory material on this topic, but I wanted to ask you if there was anything else that you wanted to add here for our audience that do you think is useful for them to know on the subject of machine learning and chemistry

Valery Tkachenko 20:33

This pretty much open question, we can talk for another few hours about it now. The area is huge and the breakthroughs are happening literally every day.

I have started to I focus on machine learning just a few years ago. Despite the fact that I’ve been working in this domain for a while, but machine learning has become my focus just two years ago. And it’s literally every day some new paper comes out and the new possibilities are being developed.

And I think it’s early for, especially for the young scientists. It’s a very good area to focus in.

Jesse Harris 21:10

I’ll agree with that. Well, thank you so much Valery, it’s been lovely to chat with you and very informative. And yeah, it was great to learn from you.

Valery Tkachenko 21:18

Thank you, Jess.

Jesse Harris 21:23

Valery gave us some great examples of how machine learning might help advance chemistry and complement the results of scientific experiments.

Charis Lam 21:30

Yes, and that point about complementary experiments is, I think, very important. Machine learning isn’t going to replace scientists or scientific jobs.

Jesse Harris 21:40

Definitely, it’s better and I think more reassuring to think of it as a tool in our toolbox.

Charis Lam 21:45

And in our next episode, we’ll learn how that tool can be used to predict chemical properties and toxicity so we can reduce our reliance on animal experiments.

Jesse Harris 21:55

Remember to subscribe so you know, when it goes live. This is the Analytical Wavelength. Until next time!

Charis 22:00

The Analytical Wavelength is brought to you by ACD/Labs. We create software to help scientists make the most of the analytical data by predicting molecular properties, and by organizing and analyzing the experimental results. To learn more, please visit us at www.acdlabs.com.

Enjoying the show?

Suscribe to the podcast using your favourite service.

Season 2, Episode 1