June 27, 2007
So the NMRShiftDB and our comparison to Modgraph and CSEARCH has caused quite a stir.
Currently we are still awaiting Modgraph’s response on the following two points:
- What is the overlap between NMRShiftDB and Modgraph’s NMR prediction databases? Further, with several different database sources how much duplication of data exists across the Modgraph databases?
- Once that overlap is removed from the dataset, what is the final deviation produced by NMRPredict?
Modgraph’s website currently claims the following two statements:
“Modgraph NMRPredict shows itself to be the most accurate carbon 13 NMR predictor in an independent evaluation!”
“NMRPredict now contains the world’s largest collection of NMR data – over 424,632 records in total!” (>345,000 are 13C records)
Let me tackle each statement one by one:
Well I think we all know where we stand. Modgraph claims a greater accuracy than us, but the question remains about their overlap and the validity of their most recent results. Meanwhile, based on our evaluation of the NMRShiftDB we have worked on tweaking our algorithms for version 11 and I will have some new results to report soon. Stay Tuned.
On “the world’s largest collection of NMR data” (>345,000 records of 13C NMR data):
If you go to Modgraph’s website here, you will see how they handle predictions when there is an exact match in the database. You may also notice on the right pane that there are MANY exact matching records in the database. In fact, it turns out that there are 25 matching records in the database for Yohimbine within Modgraph NMRPredict version 3.8.22. However, when you drill down into the database, you will discover that these aren’t exact matches at all, just different stereoisomers of Yohimbine, as well as from different sources with different experimental conditions.
But it did give us an idea on a quick experiment to run.
I ran a prediction for Benzene in Modgraph NMRPredict Version 3.8.22. The prediction was very accurate. It turns out that there are 56 exact structure matches for benzene in their prediction database. 56! And it appears to me that these matches are all logged as DIFFERENT database records.
Off the top of my head I ran a few more predictions:
38 exact structure matches for toluene
32 exact matches for acetone
39 exact matches for catechin
First of all, let me stress that it is very important to have different hits in the database from different sources. For example, entering different solvent data helps achieve solvent specific prediction. But my question is, are these indeed reported as different records as it appears? If so, this simply bloats the number of records in the database and can be perceived as a very misleading number to
quote to the public when talking about NMR prediction. Is quoting the number of records really a useful statistic to measure?
I think that a user would be more interested in the number of unique compounds in a database.
In Modgraph’s defense, these records all seem to be different in some way. They have different references, are in different solvents, temperatures, concentrations, etc. Fair enough. But it should be known that in our software, all this information is added in one record (click image to enlarge):
15 different sources for the CNMR assignments of benzene. Different solvents, conditions, sources, references, etc. But keep in mind, all this information is stored in and counted as one record!
We currently have over 186,000 records in our CNMR database. But these represent unique structures in the database. If we have more than one source for a chemical structure (and there are many, many instances) it all goes in the same record.
It seems like a good opportunity for me to explain exactly how ACD/Labs builds prediction databases and why we do not import databases from external sources.
Our database team culls the most recent literature, draws chemical structures, and tabulates NMR assignments in what we like to call the purgatory database. Once entered, these data are screened and prioritized in a way that ensures that the chemical shifts that have the most promise to improve your predictions are on the top of the list. These records then sit in limbo until one of our resident NMR spectroscopists is able to quality check, hence the name purgatory.
We are often asked whether or not we would be willing to import data from some other database such as SpecInfo, The Human Metabalome Project, NMRShiftDB, Aldrich, etc.
Our answer is generally no. Our response is not because we question the quality or validity of the data. It is simply because we have a purgatory database that literally consists of hundreds of thousands of compounds that are waiting for quality checking. This is data coming from the most recent literature that we currently do not have in our database that is prioritized in a way that is most beneficial to the structural diversity of our databases.
This is the standard we have put in place and we truly believe it is the best way for us to ultimately improve the prediction in our NMR software. If we didn’t hold quality of science at a high standard within our company, we could double the size of the database tomorrow by importing the entire purgatory database as-is.
In closing, two conclusions from me:
- I believe that we have created the best practice to follow for the validation of prediction accuracy on an independent data set. Modgraph has apparently chosen not to follow this practice and have quoted very questionable numbers, in my opinion. If they aren’t going to produce those numbers in a consistent way, this is not a comparison and those numbers are meaningless with respect to ours. Again, we are open to suggestions on a better way or a different method. But those haven’t come either. Database issues aside, it’s the algorithms behind the predictions that are absolutely crucial. This is why the issue of overlap is so important. When ALL overlap is acknowledged and removed, it represents a fair measurement of algorithm performance.
- Modgraph currently has the largest collection of records in the world. But what’s important, the number of records, or the number of unique structures? For a database used to generate NMR predictions, the structural diversity of the database is key. While a large database of experimental values will likely improve NMR predictions, it has to be structurally diverse to be useful in the real world.
EDIT: This conversation has continued in the following entries (in order):