Episode 1: Mass Spec 1100101—Data and the Mass Spectrometrist
Mass spectrometry has changed a lot over the last 20 years, and it’ll change even more over the next 20. Some of that change has been driven by hardware advances, but we’re also capable of collecting, analyzing, organizing, and storing a lot more data. How have the last 20 years looked, and what are the most promising new advances? What’s the 1100101 on data in mass spectrometry? Are machine learning and AI going to make a difference in MS?
Join us in Episode 1.Read the full transcript
Episode 1: Mass Spec 1100101—Data and the Mass Spectrometrist Transcript
How has the field of mass spec changed? And where is it going? As mass spectrometrists collect more and more data, what can be done with it? Welcome to The Analytical Wavelength, a podcast about chemical knowledge and data in agrochemical, pharmaceutical, and related industries. I’m Charis Lam, a Marketing Communications Specialist at ACD/Labs, and I’ll be your host for this episode, where we’ll be talking about mass spec: how MS technologies and capabilities have evolved, and how they’ll continue evolving as our data-handling capabilities mature.
Our first guest is Richard Lee. Richard is the Director of Core Technology and Capabilities at ACD/Labs, where he has worked for over 8 years. He received his PhD in chemistry from McMaster University in 2009, and has been working on mass spec and scientific instruments ever since. Richard, how far back does your experience with MS stretch?
Well, I guess my first exposure to MS was back in my undergrad days. So I was actually fortunate enough to be hired as a technician at McMaster University in their regional MS facility. So there were a number of instruments: high-res, low-res, LC-, GC-MS. And we ran industrial samples, but also samples from across campus. So that was my first exposure. And that got me interested. And that led me to pursuing a grad degree for a Master’s in fundamental gas-phase ion chemistry. And then I moved on to something a bit more applied in bio MS. So working with a CE-MS platform, on looking at metabolomic studies. After that, I moved into pharmaceutical development, where I was responsible for all the analytical instruments and analysis for both discovery and development. And then I moved here to ACD/Labs.
And what has changed in the time that you’ve been working with MS?
I would say that it’s really been the commoditization of MS instruments. So it used to be that there was a central lab, because it was an expert tool, right? And so you would submit samples, and you would get the analysis results back, and an analysis would be done by the expert. So you wouldn’t be doing any of the interpretation. But over the course of time, I’ve noticed that a lot of the instruments nowadays, and a lot of the labs have their own. Primarily due to, you know, the technology. It is far more, it’s far easier for maintenance of the instrument, to run the samples. If you look at an academic environment, a lot of your synthetic labs have their own MS instrument, because it is far easier to maintain. And it’s not so much an expert tool anymore. Although, you know, that said, a lot of the instruments that are considered high-res instruments are still run by experts, because they do need still a certain level of expertise, and care, and knowledge of the hardware and software for interpretation. That’s what I’ve noticed.
It also happens that, you know, there are a lot more open-access systems available in industry. So the sample, you know, the chemist who is not an expert, will go up to an open-access system that will have their samples prepared, and simply click on inject or go on the sequence. And, you know, they’ll get their results. And they will do their own interpretation. The prevalence of open-access systems seems to have driven a lot of the commoditization of LC/MS instruments.
And what changes do you see happening going forwards?
In MS in general, reading literature, looking at the trends, and talking to scientists, going to meetings. What I’ve noticed that, in the past, I would say that 10 years, there’s been a lot of development — 10, even 15, 20 years, actually — there’s been a lot of foundational development on miniaturizing the system or making the MS instrument more portable. Because they used to be these huge pieces of equipment that were sitting on the floor because they were just so massive. But now, because of the advancement of the technology, miniaturizing the electronics, the portability has become pretty impressive. You see that even now at airports. They have those swabs that you do going through security, those are a form of MS instrument.
But in terms of the actual practical application in the next 10 to 15 years, I would say imaging. That would be pretty, pretty interesting. So there are companies out there that are doing imaging analysis, imaging MS. But I would say that on a more practical level, having it being portable and having it being able to sample real life samples in, let’s say, for example, in surgery. In an OR, right? So let’s say a surgeon is excising a tumor, they need to make sure that they excise the entire tumor around the tissue. So what they’ll do is they’ll take a slice of tissue, and they will send it off to be dyed. Wouldn’t it be pretty neat if they had a portable MS instrument, and they were just taking real life samples, analysis of the tissue at hand in real time to see if there are any markers of cancer cells?
There is technology that already exists that can sample directly on surfaces. That’s called DART. And it would be pretty interesting to see how applicable that would be in a real-life surgical environment in medicine. So that’s a different, I would say a different take on where MS would go in the next 5, 10, 15 years.
That sounds really cool. And I remember seeing a conference presentation where they were talking about doing that mass spec right in the OR, as the surgeon is cutting, which was cutting down the time needed to go with a sample to pathology.
So we see that a lot of changes have happened and will continue happening in MS, with a lot of different techniques and types of instruments, and a lot of different types of data that are being collected. So how are companies handling all the data that they’re collecting? To help us learn, we have our second guest, Graham McGibbon. Graham is Director of Strategic Partnerships at ACD/Labs. He also did his PhD at McMaster University, and since then he’s had years of experience in both academia and pharma. Welcome, Graham. We see that there are many, many data formats available for mass spec. What’s the story behind them?
Well, there are a lot of companies that have manufactured mass spectrometry instruments over the years. And each one of them, in order to innovate their products, used proprietary custom encoding that allowed them to transform their data most efficiently: collect it, store it, and be able to process it and present it to users, so that they could achieve their specific goals in terms of whatever kind of analyses they were doing. These instruments have different types of ionization, different types of mass analyzers. And so, along with the different methods that can be used to acquire either scans or focus channels, that meant that the different vendors have come up with several different ways of packaging up their data in order to be able to use it very effectively.
Mass spectrometry is used for a variety of purposes, it can be used for qualitative analysis and for quantitative analysis. And often also to examine the reactivity of molecules. The simplest thing that might come to mind for people is fragmentation. So unimolecular reactions, but you can also react molecules chemically in the gas phase, and then study them using mass spectrometry. So it’s a little microcosm of chemistry that goes on there. And those analytical properties of merit, to get the best quantitative information, the most sensitivity and the most thorough qualitative investigation, what are these chemicals? How much is there from a sample? Means that different kinds of scans are used, different kinds of data acquisition, different resolutions and sensitivities of instruments, and all of those contribute to different types of data, if you will, in terms of the content of information, the way that they’re acquired in the instrument. They’re all packaged up, though, for, for use with computers these days. And so that requires acquiring those signals off the instruments and being able to put them into efficient digital formats, computer-compatible formats, binary zeros and ones, but the way that one might encode that information varies from vendor to vendor.
So it sounds like a lot of the diversity in data formats is just because there’s a diversity of ways in which people use mass spec, like the experiments themselves and their goals they’re trying to achieve.
Exactly. And because we have different companies that contribute to developing instrumentation in the space, and everybody’s slight differences or large differences will also contribute to variations in in their format. And, you know, what you can transform into, there are a couple of open formats, because data exchange between systems or post-acquisition processing using different software is of interest to people. So there are also some open formats that have been developed over the years for mass spectrometry data. All of them, though, do require some efficiency in terms of storing data. If you just stored it in in readable formats for people, it would probably end up with very impractically large files in many cases, because nowadays, some of the large data files are gigabytes in size for an individual experiment. And you can extrapolate from there if you have 10s, or 100s, of instruments in an organization acquiring data on a daily basis.
So does that mean that in addition to different formats for acquiring the data in the first place, because of the experiments or the instrument, we also have to think about different formats for longer-term data storage?
Yeah, there’s a difference in how you might package up data for use in immediate interpretations. There are basically two kinds of analyses: The confirmatory data analyses and the exploratory data analyses. And for the confirmatory ones, that tends to be what a scientist would be doing on a day-to-day basis. What’s in my sample? Have I seen the chemical that I’m expecting from the data? Is there a mass in it, which is what corresponds to my compound? And in other cases, you know, can I measure how much of it, obviously? But the exploratory data analyses are part of the future. And we can talk about that a bit more as we go through our conversation.
But for these confirmatory analyses, basically, a mass spectrometer collects a set of masses or mass-to-charge ratios for the ions that are detected. And the individual masses themselves might allow a person, if you have an accurate mass and high-resolution instrument, to deduce molecular formula, for instance, and the use of sets of masses can be further evidence of which chemical isotopes would be present. That was one of the fundamental uses of mass spectrometry from its very start. And understanding the structure of the molecules can be further inferred or confirmed, using fragmentation approaches. And so the nature of how data is stored and used depends on, as you mentioned, what kinds of information people want to get, and certain instruments may be better at doing quantitation. And other instruments may be better at doing qualitative, what is in my sample. Some instruments, they’ve tried to strike a balance.
Those are all for the scientists to use on a daily basis. And so they need to be able to interact with the data very quickly and efficiently. But for longer-term data preservation, like you would require, for instance, in the pharmaceutical industry, you need to be able to package up, compress the data so that it doesn’t take up too much space when you store it. And you need enough information to be able to find it again effectively. And there’s a tradeoff there between very high compression for effective long-term storage, and a less compressed, more accessible format that can be transformed or manipulated, processed maybe, to reduce the noise in a particular data set and find the signals to transform again. Processing transform, in this sense, not transform the fundamental encoding, but to use this data in an analysis to get at “What are the signals? What are they telling me in terms of the chemistry of my sample?”
So whenever we talk about data, especially when we start talking about big data, large amounts of data, data for exploratory analyses, there’s an elephant in the room that’s popped up over the last few years. And that’s AI and machine learning. So what do those technologies have in promise, or not have in promise, for the mass spec landscape? Let’s see what both Richard and Graham have to say.
It depends on your take on how it’s going to be applied to MS or if the MS data is going to be applied, or going to be used. So in the sense of how data is going to be leveraged for machine learning, it depends on the application that you want to use machine learning for. Okay? So let’s say for example, we have, you have LC/MS data that is going to be used to evaluate a chemical reaction, right? You’re going to use the data to identify whether or not the compound has been formed. So if you’re doing a screening type experiment, in like high-throughput chemistry, you want to use that data to see, you know, has my reaction been successful, and use that information in the machine-learning environment to predict or to direct, based on the different chemistry and chemistry reactions, which ones are successful, to direct my next steps of chemistry, or to narrow down the scope of the chemical structures that you want to synthesize. So I can see how you would leverage LC/MS data in that extent, but I think that will be very challenging right now. Not that it’s impossible, but I think it’s very, very challenging, because of how that data is going to be put into the machine-learning framework. Because right now, there is no consistent format for it. And you need to have good data to push it.
A very important thing to realize is, as we were talking about, the confirmatory versus exploratory aspect, and another aspect, which is that the majority of mass spectrometry data sets probably consists of noise or certainly not very useful, low-level signals. And in order to have a sensible use of big data, it wouldn’t be that you would want every single data point from an original mass spectrometry data set to be stored and accessible for further interpretation. In some cases, maybe, but in the broad sense of data that’s acquired, that would be very wasteful, that would be very large data sets stored where very, very small amounts of it are actually useful to leverage in the future.
AI and machine-learning approaches are basically statistical tools at heart. And they’re done for various reasons. One would be to predict some better outcomes. And another one could be to, sort of for more efficiencies, and another one could be to try and determine what something would look like, or behave like, from other information that you have without having made a compound. So perhaps, at some point in time, we could predict the fragmentation of all molecules we’ve never seen before, with sufficient accuracy and precision. And that’s important, because ACD/Labs has a long history of doing fragmentation prediction.
But getting a spectrum accurate in terms of its intensities, compared to an experimental spectrum, means that one has to take into account the timeframes of data analysis that are taking place in the various mass spectrometers, and that’s a more complex situation at this point in time. So that’s certainly a visionary goal for mass spectrometry and mass spectrometry data. But it doesn’t mean that you still need the entirety of the data set.
And the third thing is a standardized vocabulary of metadata. And there are some initiatives there that are coming along, that have been good. Every vendor has used its own terminology internally, because they had to. For instance, ionization mode, or type of ionization, for instance, electron ionization or electrospray ionization. There has to be a way to store that with a set of data or a spectrum, in order that one knows what the ionization was, because you can’t necessarily tell just from looking at the mass-to-charge values what the ionization mode was. Sometimes, in some circumstances, you might have a guess from looking at a spectrum with a lot of experience, but just a priori, it’s just a list of information, of abundances and masses usually, or something like that. And if you want to extract information, like I said, perhaps a molecular formula, if you had say chromatography with mass spectrometry, which is very common, then you might want to look at these spectra that correspond to a set of peaks. So, the standard vocabulary has to let you talk about peaks, and it has to let you talk about masses or mass-to-charge values. Perhaps for a protein you want to find out from a multiply charged set of peaks in a spectrum what the deconvoluted mass of the neutral protein would be. All these things are important to capture in terms of standard metadata. So you know what you’re talking about: what are the axes of my data in for instance, intensity, and mass-to-charge, and time? And then what are those other things? Peaks, do the peaks have heights and do they have areas? Is it in intensity units, and what kind of mass spectrometer was used to acquire it? What was my ionization? Which was the polarity? Was it positive-ion data or negative-ion data? These are all important things. And in some cases, the terms are standard. For instance, polarity is standard across the industry, because there are only two polarities. And it’s a pretty standard term. But in many of these cases, as you can see, instrument type, or even, should there be a comment field or a user ID, username, these kinds of fields of information, haven’t always had the same identifiers, the standard nomenclature, if you will.
And that’s important if we’re going to not just work in one piece of software, but like ACD/Labs does, we take from vendors’ encoded formats, we re-represent it in the ACD/Labs environment, so that we can process, transform, and extract information from the data. And then if ACD/Labs software is going to pull in other data from say, other techniques like NMR, or infrared spectroscopy, to complete an analysis, we would need to know the terms that we’d be looking for, for understanding what the other data is and what the other sample information is. Or if the data itself, or results more likely, are going to be taken from ACD/Labs and presented to some other software for some reason. It could be a larger enterprise system or lab notebook or something.
And so this this goal of standardizing terminology, it doesn’t mean that every piece of software has to operate only using the standard term, but there needs to be translation ability, in order that you can share data more effectively. And, as we were talking about that those machine-learning and AI tools, have that metadata, the standard metadata, no matter what the original data was, so that you can compare appropriately where you need to. And that’s, that’s one of the big drivers of this metadata standardization. It’s for these goals of machine learning.
But as I said, I really think that the biggest value that could come out of these machine-learning and AI approaches is from the results more than just purely the raw data, I think a lot of people would like to use their instruments more efficiently. But while the statistical tools could certainly tell them about instrument usage, if it’s standardized, it’s not an unexpected thing. People have known that they would like to use their instruments efficiently. They’d like to know if their instrument is not performing at peak efficiency. And they’ve tried to have these kinds of calibration experiments and monitoring them. If you’re using chromatography columns, you know that they change over the course of time with use. So those are the kinds of things, I wouldn’t say that they’re confirmatory necessarily, they’re exploratory, but they are about known uses and known efficiencies.
The kinds of things that we’re talking about for AI and machine learning, this sort of hype that you mentioned, is in that they’ll be really useful for showing us something that we didn’t expect, and they may be, but the data needs to be organized. And while some noise is fine, you do need real signals in with that in order to be able to extract out some insights of value. So I think there’s this balance between acquiring good data now, which is very important. The scientists know why they do that, the organization needs it for value, and trying to figure out what the ideal experiment would be to facilitate some kind of AI or machine-learning approach in the future.
Both Richard and Graham talked about organizing data, both as a potential prerequisite for AI and ML in the future, but also as something that would be good to do right now regardless. So Richard, what makes good data good?
Richard Lee 24:47
I would say that good data is data that is complete. So whatever results you have, whatever metadata that you want to input into the machine learning interface has to be a complete full set of data. It has to be well formatted. And the data has to be consistently formatted. Right now, there is no singular standard for machine learning in terms of what the metadata fields are, how the data is tagged, how … So it’s going to be dependent on what the organization is going to use, as a format. So for the example of experiments, and evaluating whether or not that chemical reaction is successful, and how to use that information to predict the future, the reactions don’t necessarily have to be successful in order to be useful data. Failed reactions are just as useful. Because then you know what areas to avoid.
Charis Lam 25:49
And what parting thoughts do you have for our listeners that they should take away about the future of mass spec and mass spec data?
Scientists’ value is in review of the data, making sure that they have good quality data, they’re doing good experiments, like we talked about, and there’s just going to be more and more data acquired. And that is going to put pressure on them to have effective tools and be able to review data very efficiently. So a consequence of that is that software, like ACD/Labs software, needs to continue to evolve and meet those kinds of needs. And I think that’s an exciting time for both our scientists and for ACD/Labs.
Charis Lam 26:33
Thanks so much, Graham. And Richard, what are your thoughts?
The world of MS is going to be pretty exciting. I think there’s going to be, you know, there’s undoubtedly going to be advancements in the hardware, you know, better, higher-resolution mess analyzers, different ionization techniques. But I think the software is going to be just as important. And you can see now that the software is stepping up as well as the hardware. So you’ll notice that there’s going to be a lot more emphasis on the software side, in order to leverage the actual analytical data coming off the instrument.
Charis Lam 27:22
Thanks to both Richard and Graham, and thank you to all of you, our listeners, for joining us today. I hope you’ve taken away something that will be of use as you think about how mass spec and especially mass spec data is going to look in your organizations and in your work over the next 5, 10, 15 years.
Enjoying the show?
Suscribe to the podcast using your favourite service.