FAIR Data—What Does it Mean for the Lab Scientist?

May 2, 2024

by Sanji Bhal, Director, Marketing & Communications, ACD/Labs

More than 70% of Scientists Are Not Familiar with FAIR Data

A recent survey[i] found that 73% of scientists are not familiar with the concept of FAIR data. As FAIR data guidelines are of growing interest for data management in modern R&D by informatics groups, data scientists, and R&D leaders, it is concerning that it has yet to be understood by everyone.

To help deepen the understanding of FAIR data principles, here we summarize the:

Goals of FAIR data
Benefits to be gained for scientists and their organizations
Priority elements for successful implementation of the FAIR data guidelines

You will also find definitions of important terms that are likely familiar (because you hear them at conferences or internal presentations) but are not part of the wet-lab vocabulary.

Goals of FAIR Data

The objectives of FAIR data, first coined in the paper ‘The FAIR Guiding Principles for scientific data management and stewardship’, are to make scientific data:

Findable

Findable data must have sufficient metadata, unique identifiers, and be indexed or searchable so it can be located quickly and efficiently. This perhaps resonates with scientists in the lab more than anyone. Knowing data exists (that a certain experiment was run) is not enough. For it to be useful, you need to be able to find it. It is common for scientists to re-run an experiment because data can’t be found. In some situations, this can be quick and easy, or necessary (a sample went off); but when it hinders you moving forward, it needlessly causes frustration and delays projects.

Accessible

Accessible data does not need to be open or editable.

Governments that fund research and many academic researchers are demanding open data claiming it will accelerate innovation and discovery. Accessibility in a commercial environment where there is intellectual property and competition can mean that conditions of access must be open and transparent (e.g., the requirements and permissions for access). Furthermore, editable data in GxP environments in pharma need access to control mechanisms and audit trails.

The interim report from the Expert Group on FAIR data (EU) notes that the principle of ‘as open as possible, as closed as necessary’ should be applied. ‘Accessible’ is an extension of ‘Findable’. It is frustrating to find data that you don’t know how to retrieve. and is inaccessible or has requirements for access.

Interoperable

From the scientist’s perspective, interoperability means that data should not require access to a particular software application. Think of an LC/MS dataset from an instrument that was decommissioned years ago. How will you open that dataset when the software was disposed of along with the instrument? If the legacy LC/MS data were stored in a standardized, normalized format, it could be opened in software you currently have access to.

More generally, interoperability means enabling data for human and machine use. The drive for AI/ML-enabled innovation demands that data be machine-actionable.

Reusable

Reuse is at the heart of the FAIR data guidelines. FAIR data is stored with metadata and context that provides information about the provenance (origins, changes, and details supporting the confidence or validity) of the data. This means that it can be used again with full knowledge and clarity around the dataset.

Benefits of FAIR Data Management

The ultimate goal of FAIR data management is to optimize data for maximum, long-term value. It is concerned with extracting additional value from data in the future. Managing data better is the process by which the efficiency and productivity of data reuse can be realized. It also gives organizations best practices for data management with many potential benefits, including:

Experiment Reproducibility

Lab scientists can empathize with this. Well-characterized data that has all the relevant metadata (data that describes data) with it, enables reproducible research. It is why we’re given a ‘formula’ for writing up experiments in undergraduate labs—hypothesis, methods, results, conclusions. Others that review your experiment can know what you did, how you did it, why you did it, your results (with the proof, e.g., the characterized NMR spectrum of your synthetic product), the conclusions you drew, and perhaps your next steps.

Machine-Readiness

There is growing inertia towards leveraging data for AI/ML applications in life sciences R&D. FAIR data enhances the ability of machines to automatically find and use scientific data for application to new problems.

Accelerated Innovation

Faster innovation is the benefit life sciences organizations hope to attain from their investment in FAIR data and data science.

Scientists rely on data for every step of their work, but the human brain is limited by the volume of data it can consume and analyze for patterns. Computers (machines/software), however, can combine vast amounts of data from different studies for new insights by applying AI and machine learning techniques.

By layering organizational knowledge and AI/ML insights into their decision-making, scientists can make leaps in innovation that would be otherwise impossible.

Cost Benefits of FAIR Data

Implementation costs are often considered a barrier to FAIRification, but an EU report on the ‘Cost-benefit analysis for FAIR research data’ estimates the cost of not having FAIR research data at €10.2 billion per year.

FAIR data maximizes value from the data that is already generated and puts it to further use. It also helps avoid duplication of effort, which not only lowers cost but also accelerates timelines across research and development.

Regulatory Benefits of FAIR Data

FAIR data helps R&D organizations prepare regulatory documents efficiently, demonstrate transparency to regulatory agencies, and respond to queries quickly. While lab scientists are usually not directly involved in regulatory submissions, they are responsible for generating the data that is submitted. FAIR data ensures that all relevant data can be collected efficiently by regulatory colleagues without interruptions to lab scientists or requests to re-run experiments—and eliminates the frustrations of not being able to access the necessary information.

Top Considerations for Implementing FAIR Data

The rate at which research data is generated—the volumes and varieties—has increased exponentially in the last several decades. The digital revolution has seen the transfer of paper-based science to glass and then through digitalization to an explosion of data that could be at our fingertips…if only it were better managed. The benefits of this digital revolution can only be realized through better governance and management. This is exactly what the FAIR data guidelines can help accomplish.

In the absence of clear instructions on how these guidelines should be implemented, however, various groups have formed to help interpret and apply FAIR data guidelines (e.g., GO FAIR). Here are some of the most important considerations for successful implementation of the FAIR data guidelines.

Standardization Is Essential for FAIR Data

Interoperability in FAIR data requires data to be standardized so it can be exchanged and used between systems and products. There are many initiatives, historical and current, for analytical data standardization.

Open Formats for Data Standardization

Open formats are non-proprietary. Their specifications are published and accessible for use. Created and maintained by standards organizations or industry consortia, these open formats must be ratified to become a standard, i.e., they require formal acceptance before they are considered a standard. There are a few ratified open data standards for chemical and analytical data (e.g., molfile, SDfile, and InChI are open formats for chemical and biochemical data representation; JCAMP for some spectral data; JSON is the current de facto open standard for data serialization and electronic data exchange) but each has its limitations.

Non-ratified data formats may still be under active development and therefore incomplete, or yet to be fully ratified, e.g., Allotrope, AnIML. These may be adopted by organizations, but they will not have the widespread acceptance and use of ratified standards.

Proprietary Formats Are a Practical Option for Data Standardization

FAIR data does not require data to be stored in an open format to be interoperable, in the same way it doesn’t require data to be open to be accessible.

R&D organizations may choose to standardize on a particular instrument vendor format (though this is challenging given the variety of techniques applied in sample characterization), or a third-party format. ACD/Labs offers standardization of analytical data for >150 instrument vendor formats and all major techniques, and export in machine-readable, AI-ready JSON format.

Data Taxonomy & Ontology

The process of FAIRifying data also raises fundamental questions about how we organize and name data objects. How do we refer to a chromatographic peak? Do we need to differentiate an NMR peak from a peak on a mass spectrum? R&D organizations must define and implement standard terminology for metadata, data taxonomies, and ontologies.

What is data ontology?

Ontology is a common vocabulary to describe data—nouns, verbs, and adjectives—with meanings for each term and defined relationships between terms. Ontologies are defined by humans, and they ‘teach’ machines how to read the data.

What is data taxonomy?

Taxonomy is a way of having standard terms and organizing data in a hierarchy that can be used to structure data consistently so it can be understood quickly.

FAIR Data Initiatives Require Managerial Commitment

Implementing FAIR data throughout an organization is a commitment for companies big and small. To be successful it requires new internal processes and policies, and to be sustainably funded (in technology infrastructure, SMEs, and internal education/upskilling). It may be rolled into larger digital transformation or digitalization projects.

Many decisions require scientists’ input. For example, those generating and using the data for its primary application may be best suited to help decide ‘what data should be stored?’

FAIRification of legacy data can be beneficial by making the volumes of historical data available for interrogation and new insights, but it is also challenging and costly. Adopting FAIR-aligned practices for new data is less resource-intensive. FAIRification could be simply going forward and/or including legacy data.

If your organization plans not to FAIRify all data, scientists may be enlisted to prioritize datasets. You may think about:

Value and relevance of the project
Uniqueness of the dataset and the resulting competitive advantage to your organization
Cost of FAIRifying legacy data (time and resources) versus repeating experiments

Culture Change & Change Management Eases Data FAIRification

Adopting FAIR data principles involves most of the R&D organization and requires a change to the way people work. Data is scattered, generated by people with a primary use in mind and myriad possible secondary implementations. While automation should be part of a FAIRification project (to remove the burden of repeat processes), any change to an established process or workflow will have teething problems and pushback. CHANGE IS HARD and must therefore be managed.

Organizations need a FAIR data culture and we as scientists need to embrace data as a company asset to be leveraged in its own right, not just “what I need to answer my immediate question”.

Upskilling Scientists Helps with Buy-in

Good science has always relied on data. Digitalization of our processes and workflows means today’s scientist is a data steward. If >70% of scientists are not familiar with the concepts of FAIR data, they are not on the digitalization journey. The consequences of this may be slow, painful adoption or failure. Upgrading scientists’ understanding of data and good data practices will result in easier buy-in, even when it impacts their workload.

Read how experts from AstraZeneca and Solvay addressed their data management challenges, here.

ACD/Labs has supported customers in improving analytical and chemical data management for decades. We can support you in your FAIR data journey, contact us to learn more.

[i] Aspen Alert newsletter (Aspen media: www.aspenmedia.inc)

About the Author

Sanji Bhal

Director, Marketing & Communications, ACD/Labs

Sanji Bhal is the Director of Marketing & Communications at ACD/Labs. Prior to joining ACD/Labs she was a medicinal chemist at Signalgene Inc., where she pursued her ongoing interest in cancer research, followed by a stint with the CRO NAEJA Pharmaceuticals. Sanji began her career in the U.K., completing her Ph.D. in synthetic organic chemistry at the University of Reading, and a post-doctoral fellowship at Cancer Research UK.