The Purgatory Database

June 27, 2007

by Ryan Sasaki, NMR Product Manager, ACD/Labs

So the NMRShiftDB and our comparison to Modgraph and CSEARCH has caused quite a stir.

Currently we are still awaiting Modgraph’s response on the following two points:

What is the overlap between NMRShiftDB and Modgraph’s NMR prediction databases? Further, with several different database sources how much duplication of data exists across the Modgraph databases?
Once that overlap is removed from the dataset, what is the final deviation produced by NMRPredict?

Modgraph’s website currently claims the following two statements:

“Modgraph NMRPredict shows itself to be the most accurate carbon 13 NMR predictor in an independent evaluation!”

“NMRPredict now contains the world’s largest collection of NMR data – over 424,632 records in total!” (>345,000 are 13C records)

Let me tackle each statement one by one:

On accuracy:

Well I think we all know where we stand. Modgraph claims a greater accuracy than us, but the question remains about their overlap and the validity of their most recent results. Meanwhile, based on our evaluation of the NMRShiftDB we have worked on tweaking our algorithms for version 11 and I will have some new results to report soon. Stay Tuned.

On “the world’s largest collection of NMR data” (>345,000 records of 13C NMR data):

If you go to Modgraph’s website here, you will see how they handle predictions when there is an exact match in the database. You may also notice on the right pane that there are MANY exact matching records in the database. In fact, it turns out that there are 25 matching records in the database for Yohimbine within Modgraph NMRPredict version 3.8.22. However, when you drill down into the database, you will discover that these aren’t exact matches at all, just different stereoisomers of Yohimbine, as well as from different sources with different experimental conditions.

But it did give us an idea on a quick experiment to run.

I ran a prediction for Benzene in Modgraph NMRPredict Version 3.8.22. The prediction was very accurate. It turns out that there are 56 exact structure matches for benzene in their prediction database. 56! And it appears to me that these matches are all logged as DIFFERENT database records.

Off the top of my head I ran a few more predictions:

38 exact structure matches for toluene

32 exact matches for acetone

39 exact matches for catechin

First of all, let me stress that it is very important to have different hits in the database from different sources. For example, entering different solvent data helps achieve solvent specific prediction. But my question is, are these indeed reported as different records as it appears? If so, this simply bloats the number of records in the database and can be perceived as a very misleading number to
quote to the public when talking about NMR prediction. Is quoting the number of records really a useful statistic to measure?

I think that a user would be more interested in the number of unique compounds in a database.

In Modgraph’s defense, these records all seem to be different in some way. They have different references, are in different solvents, temperatures, concentrations, etc. Fair enough. But it should be known that in our software, all this information is added in one record (click image to enlarge):

15 different sources for the CNMR assignments of benzene. Different solvents, conditions, sources, references, etc. But keep in mind, all this information is stored in and counted as one record!

We currently have over 186,000 records in our CNMR database. But these represent unique structures in the database. If we have more than one source for a chemical structure (and there are many, many instances) it all goes in the same record.

It seems like a good opportunity for me to explain exactly how ACD/Labs builds prediction databases and why we do not import databases from external sources.

Our database team culls the most recent literature, draws chemical structures, and tabulates NMR assignments in what we like to call the purgatory database. Once entered, these data are screened and prioritized in a way that ensures that the chemical shifts that have the most promise to improve your predictions are on the top of the list. These records then sit in limbo until one of our resident NMR spectroscopists is able to quality check, hence the name purgatory.

We are often asked whether or not we would be willing to import data from some other database such as SpecInfo, The Human Metabalome Project, NMRShiftDB, Aldrich, etc.

Our answer is generally no. Our response is not because we question the quality or validity of the data. It is simply because we have a purgatory database that literally consists of hundreds of thousands of compounds that are waiting for quality checking. This is data coming from the most recent literature that we currently do not have in our database that is prioritized in a way that is most beneficial to the structural diversity of our databases.

This is the standard we have put in place and we truly believe it is the best way for us to ultimately improve the prediction in our NMR software. If we didn’t hold quality of science at a high standard within our company, we could double the size of the database tomorrow by importing the entire purgatory database as-is.

In closing, two conclusions from me:

I believe that we have created the best practice to follow for the validation of prediction accuracy on an independent data set. Modgraph has apparently chosen not to follow this practice and have quoted very questionable numbers, in my opinion. If they aren’t going to produce those numbers in a consistent way, this is not a comparison and those numbers are meaningless with respect to ours. Again, we are open to suggestions on a better way or a different method. But those haven’t come either. Database issues aside, it’s the algorithms behind the predictions that are absolutely crucial. This is why the issue of overlap is so important. When ALL overlap is acknowledged and removed, it represents a fair measurement of algorithm performance.
Modgraph currently has the largest collection of records in the world. But what’s important, the number of records, or the number of unique structures? For a database used to generate NMR predictions, the structural diversity of the database is key. While a large database of experimental values will likely improve NMR predictions, it has to be structurally diverse to be useful in the real world.

EDIT: This conversation has continued in the following entries (in order):

http://acdlabs.typepad.com/my_weblog/2007/07/final-note-on-t.html

About the Author

Ryan Sasaki

NMR Product Manager, ACD/Labs

More Posts From Ryan

3 Replies to “The Purgatory Database”

A few words to the community in order to clarify our definitions:
A “RECORD” within Modgraph’s NMRPredict databases consists of the following information:
• Structure including stereochemical information
• Assigned shiftvalues
• Experimental conditions like solvent, temperature, techniques used for signal assignment
• Couplings, relaxation times, intensities, etc.
• Literature citation
• Remark field
The complete installation of all CNMR databases available within NMRPredict holds 345,308 records. A compound like ‘benzene’ has been measured under various conditions (e.g. Solvent: ‘neat’, CDCL3, DMSO-D6 … to HF:TaF5) – this is counted as a ‘record’. It is correct that these 56 benzenes are counted as 56 records, because every record contributes a new bit of information. For developing a solvent specific prediction scheme we believe it is absolutely necessary to have experimental data of IDENTICAL compounds, but run in different solvents. This very example of Benzene has been mentioned on our website and has been in our help files for many years – http://www.modgraph.co.uk/version3/product_nmr_help_solvents.htm
In order to satisfy Ryan’s interest in the number of DIFFERENT STRUCTURES within our 345,308 records, here it is: 289,543 DIFFERENT STRUCTURES in 345,308 RECORDS
It is interesting to see that Ryan’s blog from June 27th, 2007 comes up with a question which has already been answered by Wolfgang Robien on his web site on June 11th, 2007 – http://nmrpredict.orc.univie.ac.at/csearchlite/How_large_should_it_be.html
On this page the fundamental question about information gain related to the size of the knowledge base has been analyzed very carefully using exactly the C13NMR Predict databases with their 345,308 records.
Identical structure search:
When predicting the chemical shift values for a query structure, an identical structure search is started too. It is our INTENTION to display all stereoisomers of the query – we can easily change that on a customer’s request. From our point of view it is valuable to have information about all
stereoisomers in one single table.
Quality of predictions:
Of course customers are really interested in how accurately a prediction program can predict THEIR molecules – not a collection of external data such as NMRShiftDB. We have always encouraged potential customers to try NMRPredict themselves and predict their own molecules and have made NMRPredict available for a free 45 day trial for many years. Go to the Mestrelab Research website – http://www.mestrec.com – and either download Mnova with the included NMRPredict Desktop plug-in or download the NMRPredict client and try your predictions on-line.

Jeff,
Thanks for clearing that information up. As I mentioned in my original posting, compiling multiple sources for one compound is extremely important. I just happen to believe that all of those sources should be compiled and counted as one record when talking about databases that serve as the backbone for NMR prediction. Structural diversity is the key.
I am glad you pointed out this information and provided a more relevant number to share with the public.

Blogs are so informative where we get lots of information on any topic. Nice job keep it up!!