Skip To Content

Reflections on Leveraging Chemistry Data in Machine Learning

May 14, 2020
by Richard Lee, Director, Core Technology and Capabilities, ACD/Labs

As we are all adapting to a new “normal” during this pandemic, most of us working are doing so from home, adjusting to sharing our works spaces with our family members who are also “working” from home.  For me, it has never been more apparent that we live in a virtual world and we can do so efficiently—provided that we have the appropriate tools in place to access our ever-changing digital world.

For biotechnology and pharmaceutical organizations, the move to digitalization of scientific data has been a labor-intensive initiative, but the yields of this work can be great, if the data can be leveraged correctly to assist scientists in their daily work.  The wealth of historical scientific data within these organizations is potentially an untapped resource.  Finding new ways to structure and ultimately leverage such untapped resources can have a profound impact on how scientific research, and specifically chemistry research, is performed; if utilized in a predictive machine learning (ML) framework.  The results of these predictive frameworks are not necessarily to replace the ideas and concepts of the chemist/scientist but rather to augment their decision-making process or guide them to narrow the scope of a synthetic challenge.  This augmentation allows scientists to be more efficient, reducing the time from novel compound to, ultimately, manufacturing.


Machine learning algorithms and model predictions are only as good as the input data, and therein lies the current technology utilization challenge—to curate relevant, complete data with chemical context to provide a comprehensive chemical characterization in an appropriate data model.

In most organizations, although the data might be “digitized”, relevant datasets are often disassociated from each other, are stored/managed in disparate digital silos, and only accessible via domain-specific applications.  Additionally, the capability to associate experimental data with chemical provenance is a key unlock to tapping the potential of advanced machine learning platforms for chemistry organizations.

An example of use cases for ML would be chemical structure elucidation based on spectroscopic data.  In order to be successfully applied, the spectra must initially be normalized, be of good quality, and interpretation of the training data must be accurate and sufficiently complete.  One such type of data is mass spectral, where interpretation of fragmentation patterns of chemical structures should be correct.  Once the data is engineered, it can be applied in a pattern recognition model for de novo chemical structure elucidation.  However, one must take a holistic approach to this problem using all available data including NMR, UV, RAMAN, etc.

A broader application for predictive machine learning would include the incorporation of appropriately structured analytical datasets across the R&D lifecycle.  Depending on the stage, discovery/development chemistry typically focuses on novel chemistries to generate new chemical entities, with varying starting materials, catalysts, solvents, etc., along with some reaction condition optimization.  Evaluations of reactions are performed using generic separation methods due to the diversity of chemical entities that the chromatographic systems must support.  In contrast, process chemistry groups are more focused on reaction conditions with limited chemical variability, as they scale up the processes, thus, operational parameters are key.  For these groups, the chromatographic methods are optimized and fit-for-purpose to enhance separation between components.  If chromatographic datasets were digitally associated with chemical reaction schemes and operational conditions, and such comprehensive datasets are made available to ML platforms, cross-functional teams can take advantage of resultant data models.  These models can certainly be useful in new molecule generation and process optimization/scale up tasks.

The above two examples are a reflection of how biotechnology organizations can use their historical data to address real chemistry world problems.  By having a digital platform in place, where chemistry processes and its related analytical data are associated together, construction of input data for machine learning and predictive analytics can be simplified; thus, reducing the time for novel chemical entities to make it from benchtop to bedside.


Your email address will not be published. Required fields are marked *