June 16, 2021
By Richard Lee, Director, Core Technology
The emphasis on developing machine learning (ML) and artificial intelligence (AI) capabilities is a notable global trend, as highlighted in Gartner’s 2019 report; wherein, 46% of life science CIOs state that business intelligence and data analytics capabilities will receive the largest amount of new or additional investment.
The anticipated utilization of ML/AI is continuing to grow across a variety of industries. Many organizations are confronting the challenge of leveraging their existing data to support R&D initiatives. We at ACD/Labs have also observed growing interest for predictive analytics tools to support these initiatives. As organizations continue to conduct research activities, stakeholders are certainly recognizing the difficulty in collecting and curating one of their most valuable assets: data. It is important that stakeholders manage the automated collection of data according to FAIR principles, while standardizing and normalizing datasets to ensure effective accessibility and utilization. But, stakeholders must also consider how these processes must anticipate ongoing ML/AI initiatives.
According to results published from a Deloitte survey in 2020 some of the biggest challenges in ML/AI projects are the lack of data in a usable form, as well as the quantity of data, significant bias or errors, and a lack of people/tools to label data. Collecting, normalizing, and appropriately labelling data is a substantial undertaking. However, by investing in the appropriate data governance and curation practices at the outset, and once there is enough good data collected, the benefits of implementing ML/AI have demonstrated significant return on investment.
ML/AI has proven to excel at finding insights and patterns in large datasets and can significantly help R&D organizations leverage data to streamline workflows, increase efficiency, and boost productivity.
It’s understood that ML/AI won’t help a scientist process data more quickly, but it will provide guidance so they can, for example, more efficiently explore experimental design space.
A typical large or medium sized biotech has synthetic chemists working on single reactions and high throughput experiments. In one year, they may execute up to 700,000 reactions which will generate a minimum of 700,000 LC/UV/MS datasets for characterization of those reactions. As long as the results abstracted from the LC/UV/MS data are normalized across the instrument landscape, they may be used to reduce the scope of experimental space that must be explored in the next stage of the project, and/or future projects.
If this is representative of a year’s worth of data, think about how it would grow over 5 years!
No human could go through this amount of data and derive meaningful insights on their own, they need the power that computer processing provides; it’s just too much data to review and consume. Today, scientists need to focus on finding tools and methods to enable the analysis of large quantities of data that they need to sift through in order to derive meaningful insights. Application of ML and AI can help manage the significantly broader quantity of data from several projects, over multiple sites, for additional/deeper insights. This data, if standardized and normalized, can have multiple use cases; the first and perhaps most obvious of which is in supporting effective experimental design. But, effective data practices can also be leveraged during capital asset planning cycles (e.g., investments in hardware), product lifecycle assessments, and overall portfolio prioritization.
Increasing Efficiency & Boosting Productivity
While productivity and efficiency are related, they are not the same. Productivity refers to the quantity of work produced by a team, business, or individual. In contrast, efficiency refers to the resources used to produce that work. So, the more effort, time, or raw materials required to do the work, the less efficient the process. The goal of adopting ML should be to improve both productivity and efficiency.
That machine learning frameworks require consistent, good quality data is clear. Automation is a foundational step for this which also improves efficiency. Automating data preparation (transfer and processing) not only eases the burden on scientists but also removes variability. It makes data consistent, removes the risk of errors through manual transcription, and can be formatted for consumption by various systems, i.e., data warehouses.
Based on patterns and insights gathered from data, scientists can make decisions to move projects along more quickly or decide that a project is not worth pursuing. As a result, resources can be allocated or diverted to more promising projects to increase productivity.
Coupling human intelligence with intelligent applications and technology can guide scientists through their R&D journey more efficiently and effectively.
From conversations with scientists and IT groups it is clear that organizations aspire to leverage ML/AI; however, they are faced with the challenge of gathering complete, normalized sets of data on which models can be constructed and implemented.
Over the next several years, R&D organizations will likely continue to focus on how to normalize data into a format that can be fed into a ML/AI framework. Without normalized data, ML/AI cannot be leveraged to an extent where scientists and organizations will receive benefit.
Until pharmaceutical organizations address this data normalization problem, their machine learning aspirations will be severely limited. Organizations can leverage ACD/Labs’ experience and expertise in data science, data engineering, AI model selection, and application implementation. In addition, we are looking to engage with those that have already addressed this first step to implement ML technologies.