Shining a Light on Dark Data

February 12, 2026

by Sanji Bhal, Director, Marketing & Communications, ACD/Labs

Editor’s note: This post was originally published following a 2019 presentation and has been updated to reflect subsequent developments.

Pfizer Digitalize Analytical Knowledge to Improve Data Access & Reuse

Analytical data management is no longer a “nice-to-have” and in 2019 I had the pleasure to host a webinar with a team from Pfizer. Vijay Bulusu, Pankaj Aggarwal, and David Foley spoke about how Pfizer’s Scientific Data Cloud (SDC) is helping scientists access information quickly and easily to generate insights from data, to respond to regulatory inquiries in a timely manner, and to help accelerate the product development pipeline. Spectrus technology is central to the handling of analytical data within the SDC. In their presentation “Creating a Scientific Information Library Using ACD/Spectrus“, the team discussed:

Opportunities, regulatory drivers, and challenges that led to Pfizer reimagining scientific data management
Goals for the Pfizer Scientific Data Cloud (SDC)
Why Pfizer chose Spectrus for analytical chemistry data handling
Benefits to Pfizer of Spectrus-enabled scientific data management

Why Pfizer Wanted to Reimagine their Scientific Data Management

The Dark Data Problem

There are many data related challenges in large enterprises around how employees might find data relevant to their work, especially data generated by others. They:

(a) may not know whether the data exists in the first place, and/or where it’s kept,

(b) may not have access to the data repositories, and/or may not have training on how to use the data.

“In these situations we resort to the ‘sneakernet’. Colleagues looking for information either pick up the phone or send an email or message to other colleagues who they think may have better access to that data or information.”
– Vijay Bulusu

A poll of the webinar audience showed that analytical data is stored in lots of different systems

“We had the same issues and challenges when it comes to finding data…it’s often stored on individual computers or hard drives. We had the dark data challenge,” said Vijay.

Gartner coined the term “dark data” as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes. [1]

Analytical Data Management—An Organizational Challenge

When the SDC project began, Pfizer scientists were still exchanging documents and files manually. Data stayed in department, group or site silos. Some information flowed forwards—from research to development to manufacturing—but little information flowed backwards.

A poll of the audience revealed that the vast majority believe that some or major improvements are necessary for analytical data management in their organizations.

Asked to rank the pain of analytical data management at Pfizer at the outset of the SDC project, Vijay countered that the question is very context specific.

“At an individual level, analytical data management is not painful because you know where you store your data, and how to access that same data. It gets more challenging at an enterprise level because it’s harder to find and reuse data generated by others.”
– Vijay Bulusu

Data scattered geographically & throughout IT systems

Analytical groups at Pfizer, as with the majority of large R&D organizations, are geographically scattered. The Structure Elucidation Group (SEG), which serves global development and clinical manufacturing, consists of 11 NMR and mass spectroscopists located across 2 sites—Groton (CT) in the US, and Sandwich in the UK. “Between us we probably elucidate structure for over 600 compounds each year. We have the challenge of these two groups being in very separate locations and time zones,” said Dave. With chromatographic data the problem is even greater with many more scientists collecting thousands of chromatograms in the US and UK. Not only is the data scattered but projects also change hands.

“Oftentimes projects might be in early stage development in the US, and then move to our UK site for late stage development, and vice versa. Efficient transfer of data between both sites is becoming more and more important,” added Pankaj.

Vijay also observed that, like most large R&D organizations, Pfizer has analytical instruments from a variety of vendors, each with their own formats and data types, making comparison and interoperability difficult and time-consuming.

How much of a problem was scattered data, and what are some of the consequences?

“Oftentimes, we would find that it was easier to re-record the sample and repeat the whole analysis, rather than trying to find that original piece of raw data across multiple systems,” said Dave. “At the outset of the project we knew we were missing opportunities because the value-added information was very disparate—scattered on people’s laptops, on analytical instruments, and in people’s brains.”

Managing data for regulatory expectations, data integrity & quality

Regulatory agency expectations, data integrity, and data quality were all drivers to enhance data management.

“Over the years,” shared Vijay, “we have seen an uptick in requests for original raw data files from regulatory agencies. These raw data files were spread across multiple systems and repositories which made speedy access challenging. As an additional stressor, there was growing concern about product quality if data could not be located quickly during regulatory inspections. Data integrity was also becoming a more prevalent issue.”

Pfizer also wanted to focus on data quality. Vijay stated, “It’s not enough to store data and make it accessible. We also wanted to make sure that if I generate some data, and my colleague finds it, six months later, or six years later, the context around how and why the data was collected and captured also needs to be clear.”

Accelerating innovation with better data management

The pharmaceutical industry wants to speed up innovation and fill pipelines with new therapies. To manage fast-moving and simultaneous projects, a cloud-based data management and analysis system was expected to help keep project progress on track with efficient access to data.

Those organizational goals impact every individual. “Everyone is super busy; a lot of programs are being accelerated,” agreed Dave, “trying to make time and finding efficiencies is very important. One of the main goals was to eliminate duplication of structure assignment work—that is literally a waste of time.”

Seeking insights, not just data management

“People were trying to get away from the burden of managing data—tagging data, metadata management, etc.,” said Vijay. “Users were saying, ‘give me access to the data so I can analyze it to gain new insights,’ and our collection of scientific data systems and data repositories were not adequately capable of supporting this transition.” A new system built with cutting-edge technology would support on-demand analysis of existing data.

To help us appreciate this, Vijay compared scientific data management with taking a photo on your cell phone. “When was the last time you took a picture or video on your smartphone and then had to name the file, organize the folders in which you were to put that picture or video; and by the way, remember that information forever? That’s pretty much how data management was happening. Individual scientists had to remember where they were saving their information. And in this day and age, we argued, there has to be a better way of managing scientific data.”

Pfizer’s Data Management Goals

Pfizer’s vision was to replace manual, siloed information exchange with an automated, centralized cloud-based system (the SDC). Data generated from lab instruments and manufacturing equipment was to be automatically swept into the SDC, stored, tagged, processed, and formatted for use. This data could then be made available to other systems for reporting, prediction and modelling, and analysis, without requiring tedious transcription to Excel and other analytics tools. The SDC needed to be GMP-compliant and to store and analyze multiple data types —spectroscopy data (NMR, MS, etc.), chromatography data (LC, UV, etc.), characterization data (PXRD, etc.), manufacturing data (PAT, etc.) – as well as different formats – structured, unstructured, reaction schemes, chemical structures etc.

The SDC was also designed according to FAIR principles [2]: the data was to be Findable, Accessible, Interoperable, and Reusable.

Why Pfizer Chose Spectrus for Analytical & Chemical Data Handling

“We are analytical chemists in Pharma [working] on small molecules… everything is based on the structure.”
– Pankaj Aggarwal

The Spectrus Platform integrates data from NMR, MS, LC, GC, UV, and more
Pfizer scientists had existing site licenses for ACD/Labs Spectrus software so they were familiar with the ACD/Labs platform
Scientists can search Spectrus databases using chemically intelligent parameters: chemical structure, substructure, and spectral elements (retention time, peak, peak area, etc.)
Parameters central to a scientist’s workflows and desired data search parameters could be integrated into the solution—at Pfizer, including ELN record numbers and PF numbers was important
Applications on the Spectrus Platform can use routine spectral data to speed up workflows and provide insights to support faster decisions; for example, training the NMR prediction database can support faster (or automated) structure verification.

“Assigned NMR data automatically feeds into ACD/Labs’ NMR training database for chemical shift prediction and can be tweaked by subject matter experts (SMEs). By introducing Pfizer specific compounds with associated chemical shifts into the database we’re honing the predictions.”
– David Foley

Benefits of Spectrus-enabled scientific data management at Pfizer

Centralized, reusable analytical data

Data is automatically swept to the cloud from analytical instruments—“the scientist doesn’t have to deal with all the data conversions or integrations…and the complexity of the scientist needing to know the source of the data is removed. We have been able to get all the data into a central location making it available to the general user who can use it for numerous different applications,” said Vijay.

“I don’t need to go into my ELN, to call my friend and ask for those method conditions or search for the method conditions from LIMS. I have it readily available and I can go and try to repeat the method in the lab,” added Pankaj.

Facile search across data collected over multiple sites, groups, instruments, and data types

We have a fully searchable, gold standard database of NMR and mass spectra. It can be searched by PF number (Pfizer compound number), structure, and substructure. Scientists can do spectral matches. They can input NMR or mass spec and quickly answer the question “is there a spectrum or compound in the database that matches closely for this? Time spent searching for information is reduced way down,” explained David.

Increased efficiency through elimination of duplicated effort and repeat experiments

Pankaj described how access to a centralized library of methods data is resulting in significant time savings. “If I’m trying to develop a chiral method I can search the database to help answer the question ‘does a chiral method already exist?’ If I’m in development, a method might have already been developed in the research environment. [Now] I can just draw that structure and search the database to see if there is a pre-existing method that I can use or start with instead of developing a new chiral method from scratch.”

Insights are being drawn from analytical data

“The type of trend analysis I’m able to do with the Spectrus library would take a significant amount of time before. I would have to go across multiple systems like ELN, LIMS, Empower, assemble all that data into an Excel file and plot it simply to learn that chromatographic peak tailing was only seen from a particular LC instrument.”
– Pankaj Aggarwal

How Pfizer’s Analytical Data Strategy Has Evolved

Since 2020, Pfizer has continued to expand analytical data management while maintaining their core objectives. In a recent presentation “Digitalizing Scientific Data Management at Pfizer PharmSci”, Bo Du (System architect, Worldwide Research & Development, Pfizer) provided an update to the continuing digitalization of analytical data across the PharmSci Small Molecule group at Pfizer. He shared how the original effort—initially focused on chromatography data—helped inform a broader scientific data management strategy.

Pfizer are exploring newer ACD/Labs technologies and continue to expand workflow digitalization to improve data management and efficiency through:

A new dashboard for solid-state characterization and drug product analysis
Consolidation of data for pharmaceutical development to track batch genealogy, control impurities, and support stability studies and forced degradation
Expansion of MS data handling for structure characterization and elucidation
Automated processing of high throughput (HT) LC/MS data that has reduced manual data manipulation from 4 hours to less than 30 mins.
Exploration of tools to reduce manual data transcription and processing/analysis for HT workflows

Rather than creating new silos, Pfizer are extending a consistent digital framework to support multiple data types and workflows. Reflecting on their progress, Bo noted:

“What began as a focused effort to improve access to chromatography data quickly revealed opportunities to connect analytical data across techniques and teams.”
– Bo Du

While the scope of Pfizer’s data digitalization project has grown, the underlying goal remains unchanged: ensuring analytical data is searchable and reusable to support scientists’ decision-making and data science initiatives.

References

Gartner. (n.d.). Dark Data. In Information Technology Glossary. Retrieved Jan. 12, 2026 from https://www.gartner.com/en/information-technology/glossary/dark-data
M. D. Wilkinson, M. Dumontier, I. Aalbersberg, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://www.nature.com/articles/sdata201618

About the Author

Sanji Bhal

Director, Marketing & Communications, ACD/Labs

Sanji Bhal is the Director of Marketing & Communications at ACD/Labs. Prior to joining ACD/Labs she was a medicinal chemist at Signalgene Inc., where she pursued her ongoing interest in cancer research, followed by a stint with the CRO NAEJA Pharmaceuticals. Sanji began her career in the U.K., completing her Ph.D. in synthetic organic chemistry at the University of Reading, and a post-doctoral fellowship at Cancer Research UK.