January 11, 2017
The familiar crop of characters—Spock, Kirk, Uhura, McCoy, Sulu, Scotty, and Checkov—are understandably portrayed by different actors in the 2016 movie than in the origins of the STAR TREKTM franchise. Science fiction like this offers a view of the future imagined from past and present experience and it made me consider where data science is going, whether boldly or otherwise, and also, where it came from.
Like modern managers in science-based organizations, Spock valued data to make effective decisions, and this will likely also continue to increase in the real world for the foreseeable future. Data is the lineage of information which provides knowledge that enables managers to make strategic and tactical data-based decisions for actions that maximize benefits and limit risks. Therefore, data exchange between organizations and data sharing inside organizations is necessary to effectively communicate this “data-information-knowledge” lineage. However, such an approach demands dealing with the data deluge coming from the overwhelming volume, velocity, variety, and variability of analytical data. In data workflows, not only Big Data but analyses generating a variety of not-quite-so-big data, two factors contribute greatly to the deluge—the first aspect is the automation and/or parallelization of specific high-throughput analyses on particular instruments, and the second is the challenging implementation of the so-called ‘Internet-of-things (IoT)’ due to the tremendous assortment of computer-based data sources and their diversity of parts, performance attributes, and output of analytical data formats.
Nestled between pen and paper of the dwindling past and hoped for tricorders of the future, is a cornucopia of proprietary data constructs necessary for data source or instrument hardware innovation and the efficient acquisition and storage of bytes-to-petabytes of analytical data. Ongoing heterogeneity of analytical data formats is thus a natural hallmark of technology advancement. Naturally though such a plethora of data formats begets the desire for standardization. Standardization is not exactly the same as having a human parsable, purely open generic format such as ASCII text or XML.
Much like the STAR TREK franchise, efforts by various groups in several iterations of initiatives embody analytical data format standardization. The following is a personal and not exhaustive recollection of some of those analytical data standardization efforts:
- One of the earliest was by the Galactic Company starting back in 1986 (now part of Thermo Scientific) with a binary format for a variety of spectroscopic data (SPC).
- Also in the 1980s, the ASTM E01.25 sub-committee for Laboratory Analytical Data Interchange protocols and Information Management were conducting an effort that led to the ANDI data standard (NetCDF).
- Circa 1995, the Joint Committee on Atomic and Molecular Physical Data format (JCAMP-DX) was established with the involvement of the International Union of Pure and Applied Chemistry (IUPAC).
- In 2003 IUPAC was looking toward markup languages for data formats. Not surprisingly, the same year the ASTM E13.15 subcommittee with the involvement of a range of stakeholders spearheaded an initiative to make a new XML-based standard (AnIML) for analytical data. During its emergence, the mass spectrometry community, among others, were also actively trying to create other exchangeable analytical data formats (mzXML, mzData), and eventually the de facto standard mzML evolved.
- More recently a subset of organizations from the pharmaceutical industry formed the Allotrope Foundation with an aim to establish an Analytical Data Taxonomy (ADT) and also to create their own Allotrope Data Format (ADF), contracting a third-party software company to build a supporting software framework. Like others before, prior experiences are being considered but making the ADT seems the newer, interesting part of the exercise.
Meanwhile, for the past twenty years, ACD/Labs has developed a software platform under the Spectrus brand with technologies that enable management of analytical data coming from an extremely wide range of instrument vendor detectors in hundreds of proprietary and open data formats. The platform allows handling an analytical data deluge from many data formats from a variety of data sources in a diversity of data workflows with algorithms to perform and facilitate data processing, data interpretation, data aggregation, homogenized data storage, data querying, and data mining for data presentation and data science.
A single, universal, efficient open analytical data format may look like the ultimate ideal for the data future. Though, I seem to recall that the SS Enterprise has seen more than one version over the years and I won’t be surprised to journey on toward the data future in an ecosystem of multiple, useful analytical data standards, including Spectrus, for years to come.
STAR TREK and related marks are trademarks of CBS Studios Inc. – See more at: http://www.startrek.com/terms-of-use#sthash.3PwcQnkt.dpuf