Eating the Elephant of Chemical Data Digitalization

What does digitalization mean when it comes to analytical data? Who does it help? How does it enable automation, AI, and machine learning?

Feel like you have more chemical data digitalization questions than answers? Our team of experts at ACD/Labs is ready to assist with your digitalization, automation, and innovation goals. Richard Lee joins us in this episode to share his insights on how to minimize disruption, maximize efficiency, and ensure success in digital transformation.

Read the full transcript

00:00 Richard Lee

For digitalization to be effective, you need to have automation services that are working in the background to make that data usable. It then enables the user to make decisions on that data, instead of looking for data to compile together.

00:24 Sarah Srokosz

I don’t know about you, but I feel like everywhere you look today you can find someone talking about AI. AI-based tools are now available for just about every aspect of our lives—at home, in our studies, and at work.

00:37 Jesse Harris

Yeah, and I think now that people are more familiar with what AI can, or more importantly cannot, do, it seems like the possibilities are endless. While this is exciting, I think a lot of people are also feeling overwhelmed.

00:50 Sarah Srokosz

Absolutely, and for good reason. There’s a big difference between a technology being available and implementing it in your day to day life. But I think it’s similar to any new technology. There are both pros and cons, and at the end of the day, it can just feel really intimidating to get started if you aren’t an expert.

01:09 Jesse Harris

It certainly does feel like that sometimes, but the truth is you don’t really need to be an expert yourself to benefit from innovations like AI and automation. In chemistry research, you just need to have one on your side.

01:21 Sarah Srokosz

Hi, I’m Sarah.

01:23 Jesse Harris

And I’m Jessie. We’re the hosts of the Analytical Wavelength, a podcast about chemistry and chemical data brought to you by ACD/Labs.

01:31 Sarah Srokosz

Today we’re chatting with Richard Lee. As the director of Core Technologies, Richard is our resident expert when it comes to incorporating automation, data science, machine learning, and AI into chemistry research.

01:44 Jesse Harris

He recently wrote an article with the Analytical Scientist where he laid out tips for making sure your digitalization efforts are successful. We wanted to talk to him a little bit about it so that he could share that wisdom with you as well. Hello, Richard. How are you doing today?

02:19 Richard Lee

I’m doing pretty well, thanks. How are you.

02:03 Jesse Harris

Doing well, doing well.

02:04 Sarah Srokosz

So, Richard, thanks so much for joining us on the podcast. I want to start off with our question that we ask all of our guests, which is, what is your favorite chemical?

02:13 Richard Lee

Ooh, so many to choose from. I’d have to say caffeine, I think for obvious reasons. I consume it on the daily; multiple times a day. But it also has a little bit of a special place during my grad days, I used it as a kind of like a marker for a lot of my experiments, so I literally used it almost every day during my grad school days.

02:39 Jesse Harris

I also used it almost every day during my grad school days. But yeah, I’m a big fan of the caffeine myself.

02:47 Sarah Srokosz

I knew Jesse was going to like that answer.

02:52 Jesse Harris

Okay, let’s get into things. We’re talking to you today about digitalization. What is digitalization and why is it important for chemical data to be digitalized?

03:04 Richard Lee

Oh, good question. So at least when I speak of digitalization, it refers to having data in a format that can be accessed and shared by humans, but also by machines as well. So that’s quite important. So that’s a new element that we have to take into consideration. I guess in the chemistry world, it implies the conversion of physical experiments into its digital equivalent.

So providing some sort of chemical context to the analysis or the analytical data, for example, compiling a set of analytical data to represent an impurity profile of a drug substance, or you have a set of analytical data that represents a biotransformation map for a metabolism study. Analytical data doesn’t, on its own, doesn’t really provide enough context to make decisions on; to enable decisions to be made. It requires some sort of contextualization, a framework or reference for that data.

But it’s not easy. There are a significant number of systems that handle chemistry related data, like your ELNs and LIMS, registration systems, you have vendors that have their own software… But the problem is that this type of data is proprietary to their own system, so it’s not really accessible by other systems. So for chemical data to be digitalized, it also means that the data needs to be shareable, or shared across systems or interoperable. But also when we talk about digitalization, there’s one aspect that’s really not spoken a lot about, is the need for automation services that are required. So, you know, I touched on compiling datasets together, and while, you know, the scientist or the chemist in the lab can certainly do that manually, it is labor intensive, right?

They can stitch the required information together, but it’s really not that efficient. So for digitalization to be really effective, you need to have automation services that are kind of working in the background to make that data usable. It then enables the user to make decisions on that data, right? Instead of looking for data to compile together.

05:31 Sarah Srokosz

Okay. And so just to clarify, I think you touched on it, but, you know, when scientists see their data, they’re seeing it’s digital. You know, it’s being displayed on a computer and pretty much all instruments and pieces of software nowadays use this data that I think most people would just consider digital. But the difference between that and what you’re talking about is this kind of standardization piece that makes it readable by both machines and humans.

Is that right?

06:04 Richard Lee

Yeah, a little bit, yes. Data is digital, right? So in a sense that it is encoded into a file. A PDF is a digital file, right? You see it on a computer screen, like you said, but it has no dimensionality to it. It is a static picture. You can’t do anything, a lot with it. PowerPoint is also another case. You know, it is a digital file. It could have images of your chemistry experiments, could have spectra and chromatograms as images, and in the sense that is still digital. But data and these formats are essentially dead ends. So meaning that you can’t really do anything with the data, like you can’t query against it, you can’t access the spectra or chromatograms, right?

So these types of formats are still widely used in organizations. Microsoft applications are ubiquitous across basically every business We use Microsoft Word, Excel, PowerPoint, and it’s easy access, and users are familiar with it. But the data that’s stored in these types of files are essentially dead for the chemist to use. They’re endpoints.

Instruments acquire data on some sort of sample, variance analysis, it represents… it’s represented as bytes and can be unpacked to be visualized in a specific application, but it is a singular data file. So it also has its advantages and disadvantage, just like for, you know, PowerPoint presentations or PDFs, you can compile data from many sources together, but into a static file or statis image. Instruments or data from instruments are far more dynamic in the sense that yes, you can look through that data, it is rich, you can interrogate that data, the scientists can reanalyze or reinterpret, but again, it is a single analysis. In most cases, if not all, that data itself is locked to that proprietary data format. The data itself can be queried and is far more useful in a sense, but it’s only useful if it lives in that vendor environment. So the data is very much siloed into separate systems.

08:26 Jesse Harris

Yeah. And I imagine that really the limits what you can end up doing with the data later on. Now you wrote recently about digitalization and the importance of it, and one of the points that you highlighted there as a tip, was is the importance of starting simple and starting small. What did you mean by that and what does that look like?

08:47 Richard Lee

Speaking actually, when speaking to a lot of clients and organizations, those that are undertaking a digitization effort or initiative or those that are in that process right now, it seems like more often than not, they do have a very grand overall vision of their data transition. They have general ideas of what their system should ultimately do, you know, and they will often use the latest technology or ask about the latest technology.

Certain buzzwords are used. But what I really think is it’s missing the task of being able to break it down into smaller data flows. You have a grand vision, but you can’t boil the ocean. So sometimes clients are stuck or paralyzed because they don’t know where or how to start. Sometimes organizations are waiting for technology that is not yet available in the management systems, and they think that by waiting, this new technology will satisfy their needs somehow. And again, further delaying their digitalization efforts.

For us, to start simple, you really want to look for data flows or workflows that would benefit from analytical data management. So that might be might be an obvious statement, but it’s true. So you want to look at a data flow where you know where automation can fit in, pick instruments or data sources that are creating a lot of data, where you are currently doing a lot of manual processing or processes, either transferring that data, moving that data from one location to another, a lot of routine data processing or routine report generation. Look for those types of workflows, and that’s where you’ll get your most simple start to your digitalization effort.

10:45 Sarah Srokosz

Yeah, that sounds about right. What is it they say, the best way to eat an elephant is one bite at a time?

10:52 Richard Lee

Exactly.

10:54 Sarah Srokosz

Yeah. And I think another important thing that you wrote about in the article, especially when you are breaking down this overall goal of digitalization into smaller tasks, is that you need to consider the needs of data scientists and not just the needs of the experimental scientist. How do the needs of those two personas differ?

11:18 Richard Lee

So I guess when we talk about a scientist or chemist in the lab, you know, we often envision them acquiring data from specialized instruments, analyzing or interpreting data on expert applications, chromatograms, spectra, and any other measurement instrument. But now organizations want to leverage this data that’s being generated and glean insights to their experiments. And right now, that technology is available.

So they’re using more business intelligence type applications and learning frameworks to gain some insight to their experiments as a whole. These organizations are attempting to leverage BI or ML frameworks for predictive sciences. Or at minimum, they want to provide additional guidance to the scientists. So provide them with direction in terms of what synthetic route to go based on previous knowledge.

But these types of experiments, these types of applications, sorry, can’t use the raw data or the processed spectra or chromatograms as the chemist would. The data that’s required needs to be in a different format. And it’s often that the data is abstracted from the interpretation of the analytical data.

In addition, the abstracted data also needs to have some sort of chemical context to it. For data to be useful in these cases, data scientists, they need access to the abstracted data in a format that is amendable for BI, machine learning, or artificial intelligence applications. So for these people or for this group of data scientists, they require different tools, right, from applications and systems. As I mentioned, they need abstracted data, so they need the appropriate tools to perform data mining.

Whereas the chemist or scientist, they will be at a computer, they have a fully very rich interface. They can interact with chromatograms and spectra, they can interpret and zoom in on spectra, integrate peaks, etc. But data scientists’ needs are very different. So for them, because they need access to the abstracted data, we need to have tools that will enable them to pull specific information from the analytical data in a specified format. And the data needs to be curated before they can do that. So some of the tools that they require are API; so application programing interfaces, as an example. So these tools are designed specifically for data scientists to access data content, or you know, touching on curated data, as I mentioned. Because analytical data comes in a variety of vendors and formats, the metadata that needs to be associated with the data will be quite different.

But in order for these types of applications like BI or machine learning to consume that data, it has to be I guess there has to be uniformity to that metadata. So these applications need to understand that incoming data will need to be uniform, standardized and then it can be consumed. Otherwise that data itself would be kind of useless.

So we need tools in place for the data scientists to make that standardization and uniformity so that all the data that is being fed into these BI or machine learning frameworks can be used effectively.

15:10 Jesse Harris

This relates to another one of the points that you made in your piece about having the future in mind when you’re setting up these systems. Was that about having that future state in mind, like the goal that you’re trying to get to? Or is that about forward compatibility or maybe a little bit of both?

15:26 Richard Lee

I think it’s a little bit of both. So when you start out on your digitalization journey and you are trying to determine how to start, you know, we touched on choosing your specific workflow or a basic workflow. Think about how the chosen workflow can cross-pollinate into others. So how can you scale up your data flow that’s already been deployed and how can we apply that to other areas?

It’s always keeping in mind how can we reuse what we already have? So you’re not starting from scratch. That’s something that we always try to encourage our, you know, customers or clients or organizations that we work with to consider. And we try and identify as many of those common elements. So then, you know, when we do scale up with because there are more instruments that are being used or there are more personnel using those types of instruments that we can scale accordingly.

16:36 Sarah Srokosz

Okay. That makes sense. And another tip that you suggested was to consider our visualization needs. I’m curious as to what you are talking about with respect to visualization needs and why is that important?

16:52 Richard Lee

When we engage with organizations, you know, we have the scientists that are going to be the consumers of that data. We have to consider the IT infrastructure, personnel, that are responsible for essentially the backend subsystem support. More often than not, the scientists, their needs are sometimes not as obvious or not as voiced during these projects, but ultimately they will be the ones using the system.

So there could be cases where, you know, the system is very well elegantly designed in terms of an IT infrastructure, it’s very efficient. It does exactly what it needs, getting data from one place to another, doing all these intermediate processes. But at the end of the day, if the scientists can’t see that data or access that data, it still won’t be useful for them and that system itself won’t be used at all.

So we really have to keep in mind what is the scientists going to do and how do they want to see that data and access that data? Because they are the important ones. They are the ones that are going to be making decisions based on results that they are able to visualize, interrogate their data, reanalyze their data. So that’s something very much to keep in mind.

18:29 Jesse Harris

This actually almost reminds me of preparing food. It’s not worth gathering the ingredients and preparing things if the person that you’re serving to is… exactly. So that’s an interesting parallel.

So ACD/Labs helps out a lot of companies with their analytical data management needs, and that often involves the use of the Spectrus Platform, which is something that you’re obviously very familiar with. So how exactly does this technology help people out?

18:56 Richard Lee

Sure. So having data in a singular format is, I think, critical for any type of analytical data management solution or system. For us, being vendor agnostic allows us to consume raw process data across any of the major vendors and across a variety of analytical techniques, into a single format while we retain its full, I guess, data fidelity, meaning that it is a true one for one copy of that raw data that’s been acquired off an instrument.

So homogenization of data affords a significant number of advantages. Single system where chemists can access to data, on demand, so they can do this without the need of going back to the originating data. It’s all in one place; they can access that data from…they have a single system where the data is located.

The Spectrus Platform also, and more importantly, enables the addition of chemical context to the analytical data. So not just chemical structures that can be added, which we’ve been doing for a number of years, but we can also compile that data into studies. So associating related data together and to really tell a story of that experiment. So data or these studies themselves can be queried. We have a very strong database. It can be queried through metadata, through various chemical structure queries, spectral queries and the like.

But I want to also acknowledge that we are one piece of the IT ecosystem. We handle analytical data very well and I think we do it the best in the business, but we are a piece of the visualization ecosystem in an organization, right? So no system can do it all. But for systems to be useful, they need to have the ability to, to integrate well with one another. So in that regard, we have several different mechanisms in place for integration into other applications or other systems, and then for others to access our data in our environment. So this is especially important for data to be used downstream and data science related activities.

21:27 Sarah Srokosz

Okay, great. That was a lot of super helpful information. Are there any other thoughts that you want to mention for anyone considering how to level up their analytical data management and maybe how they can get started if they haven’t yet?

21:41 Richard Lee

I think you just have to start. You can’t wait. You can plan for as long as you’d like, but then you are just delaying the inevitable. You know, you really have to start somewhere. And if you are not sure how to design your system or can’t exactly pinpoint on where to start or how to start, you know, they can always come to ACD/Labs and we can always help them along that journey.

22:10 Jesse Harris

Excellent. Well, thank you so much for joining us today Richard. It’s been a pleasure having you.

22:15 Richard Lee

No problem. Thanks, you guys. Great being on.

22:18 Sarah Srokosz

Well, all I have to say is it’s a good thing there are experts like Richard out there to help make AI and other technologies a reality for scientific organizations.

22:28 Jesse Harris

You can say that again. If you want to read the five tips for digitalizing analytical chemistry data that Richard shared with the Analytical Scientist, there’s a link in the show notes to the article.

22:40 Sarah Srokosz

And if you’re looking for guidance at any stage of your digitalization journey, ACD/Labs experts like Richard are always happy to help. Visit our website to learn more about our software solutions and consultation offerings to help you reach your digitalization, automation and innovation goals.

22:56 Jesse Harris

Thank you for joining us today and don’t forget to subscribe to the Analytical Wavelength so you never miss an episode.

The Analytical Wavelength is brought to you by ACD/Labs. We create software to help scientists make the most of their analytical data by predicting molecular properties and by organizing and analyzing their experimental results. To learn more, please visit us at www.acdlabs.com.

Enjoying the show?

Suscribe to the podcast using your favourite service.

Season 3, Episode 6