Final Note on the NMRShiftDB Debate- 0.96 ppm

July 12, 2007

by Ryan Sasaki, NMR Product Manager, ACD/Labs

Perhaps some readers will believe that I have spent too many posts on the NMRShiftDB debate. I would like to re-iterate that one of my main purposes for creating this blog was to keep the public informed on the world of NMR software. In this particular debate, it has been possible to inform users of NMR software around the world with more information on three different NMR software resources, The NMRShiftDB, ACD/CNMR Predictor, and Modgraph NMRPredict.

So with that I would like to make one final posting on the prediction validation of the NMRShiftDB. The previous presentation of results may lead one to believe that Modgraph has the edge in prediction accuracy.

I have continually asked Modgraph to clear up the overlap issue that we have so heavily discussed here. Again, what is the amount of overlap of identical chemical structures between the NMRShiftDB and Modgraph NMRPredict database? Once that overlap is acknowledged and removed, what is the accuracy of this software package on completely novel chemical shifts?

As a co-author of the original validation of prediction accuracy on the NMRShiftDB dataset, I firmly believe that we have created a best practice for prediction validation where we openly shared our structural overlap and generated a prediction error on completely novel chemical shifts.

Our core belief is that information that is provided to the public should be shared with the best interest of quality science in mind. As a result when we publish prediction accuracy numbers, we strive to produce numbers that best reflect the performance of the prediction algorithms in a way that is relevant to an end user. In our original study we produced an error of 1.59 ppm for the entire NMRShiftDB. In order to produce a number that is a true reflection of the prediction algorithms, we always exclude equal codes from the database of structures when doing an accuracy assessment. In addition to this, we also clearly acknowledged that there was a significant overlap between our prediction database and the NMRShiftDB (57%). As a result we produced a number for a subset of the NMRShiftDB (43%) that excluded all of this overlap (1.74 ppm). This is what we believe was a true prediction validation on novel chemical shifts, and would be similar to what a user could expect to get for novel structures and is therefore a relevant number for users to know.

So if Modgraph is unwilling to share the important details of this study regarding overlap with the public, we can only respond by providing the public with a prediction accuracy value that can fairly be compared to theirs.

As a result, we have produced a prediction accuracy result that DOES NOT exclude identical HOSE codes. It is simply the result that we have generated by downloading the SDF file of the NMRShiftDB and running it through version 10 of ACD/CNMR Predictor, the same version that is available to customers today.

The result of this study is an average deviation of 0.96 ppm.

That is an incredible result, no doubt. But understand that it is not the most relevant result in practice. We have simply provided this new number of 0.96 ppm in order to provide our users and followers a number that they can accurately compare to the number of 1.40 ppm provided by Robien using Modgraph’s NMRPredict.

In our original validation document, we freely shared the news that 57% of the structures in NMRShiftDB were also found in the ACD/Labs CNMR prediction database. We believe that the most important result to an end-user would be the performance of the predictor on the remaining 43% of the structures that are not present in ACD/Labs CNMR prediction database, hence the value of 1.74 ppm.

I should also add that the major reason we conduct these validation studies on independent datasets is to challenge our own technology. We are always very eager to test our predictors on new data because it provides us with an unbiased evaluation of the performance of our software. In addition, many times, these validations help us improve our software. As a matter of fact, during the evaluation of the NMRShiftDB we have identified areas within our software in which prediction can be improved. We have already made changes in preparation for version 11 and have achieved significantly better prediction accuracy numbers than the 0.96 ppm I have presented today. I will however not share these numbers to the public since these improvements will not be available until the release of Version 11 in the fall.

For now:

An average deviation of 0.96 ppm on the entire NMRShiftDB database (214,136 chemical shifts).

An average deviation of 1.74 ppm on the subset of shifts in the NMRShiftDB database that are not represented in ACD/CNMR Predictor (92,927 chemical shifts).

If the overlap between the two databases is not acknowledged, prediction accuracy results cannot be trusted. For this reason, I will not post any other prediction accuracy results on this blog, either from ACD/Labs or any other software company or research group, unless the overlap with the independent dataset (NMRShiftDB) is clearly acknowledged.

However, in the end, I agree with Jeff Seymour from Modgraph, Consultants Ltd. when he says:

“Of course customers are really interested in how accurately a prediction program can predict THEIR molecules – not a collection of external data such as NMRShiftDB.”

Go ahead and evaluate the software packages on your own.

About the Author

Ryan Sasaki

NMR Product Manager, ACD/Labs

6 Replies to “Final Note on the NMRShiftDB Debate- 0.96 ppm”

x says:

July 13, 2007 at 2:12 am

It’s been interesting following the discussion about the various NMR prediction softwares. But I wish you could include other programs/packages as well such as:
ChemDraw’s built in 1H/13C estimation
The Perch tools incorporated into TopSpin
… and others which I might not be aware of
In my opinion a review of the all programs available for NMR prediction would be very interesting reading. Perhaps one is available already?

Reply
Egon Willighagen says:

July 13, 2007 at 2:26 am

Funny quote: “Of course customers are really interested in how accurately a prediction program can predict THEIR molecules – not a collection of external data such as NMRShiftDB.”
Sure, they know what THEIR molecules are 🙂
It is interesting to realize that the NMRShiftDB allows you to upload your molecules, or alternatively, you download the software (it’s open source) and the data (it’s open data) if you don’t want to send your molecules over the internet, and the NMRShiftDB software will automatically take into account your own data set.
Thus, if you are working on a series of related molecules, you can extend the NMRShiftDB data set with already elucidated structures, reducing the prediction error for your yet related unknowns derivatives. It is that easy to include prior/expert knowledge in the NMRShiftDB.
I believe the ACD/Labs software allows this too, so the quote is really meaningless. Not correct, not wrong, simply says nothing.

Reply
Ryan says:

July 13, 2007 at 9:22 am

Hi, and thanks for the comments.
In regards to prediction comparison to other products. I would love to have the opportunity to do that. Unfortunately, it is not always possible for me to get access to a specific software packages (especially considering I am employed by a “competitor”), and further it isn’t necessarily that fair for me to generate predictions in another vendor’s software until I am properly trained in all of their options.
One of the reasons I have been so excited about the NMRShiftDB is because it is open access, it allows anyone to download the SDF file and run it through their predictions for validation. For that reason it should be quite easy for Cambridge Soft and the Perch tools to do the same that both ACD/Labs and Modgraph have done to this point. If they are willing to do that and share those results, I would be happy to share them on my blog provided the study was conducted correctly.
FYI, ACD/Labs has an older comparison between ACD/HNMR and CNMR Predictor vs. Cambrige Soft’s ChemNMR on their website.
http://www.acdlabs.com/products/spec_lab/predict_nmr/chemnmr/
Beware however, it is an older comparison from probably about 3 years ago and I am sure both ACD/Labs and Cambridge Soft’s predictions have improved significantly since then. Also please note that this was a dataset compiled in-house by ACD/Labs using a leave-one-out analysis.
I think a better comparison would come from Cambridge Soft downloading the SDF file from the NMRShiftDB and generating a prediction error on their own. This would ensure a result from the most recent software packages.
I am of course not issuing a challenge, I just think it would be the best way to accurately compare this package with others.

Reply
Ryan says:

July 13, 2007 at 9:49 am

Hi Egan, and thank you for bringing up a very important point, one regarding prediction training. Of course if users add their own data to the prediction engine,it will improve the predictions on THEIR molecules. This feature has been available in ACD/Labs NMR predictors for years, and is also available in Modgraph’s NMRPredict.
However users can only add already elucidated structures to their databases in order to improve the predictions. I think when Jeff and I refer to “THEIR” molecules, we are referring to the end-users molecules that have not yet been elucidated (and thus not yet trained)
This is why I believe the prediction accuracy value on the subset of novel chemical shifts (1.74) is the most relevant to an end-user because it doesn’t take into account those chemical structures that are already in the database.
So in the end, I don’t believe the quote is meaningless.

Reply
Malcolm Beckett says:

July 14, 2007 at 5:26 pm

This is great openness and transparency is what I want, not my Dad is bigger than your Dad arguments seriously the most important aspect of this discussion is that we learn and are open to new ideas thanks to both Jeff and Ryan for the debate. In anticipation that this will hopefully be discussed again in a few years time

Reply
Ryan Sasaki says:

July 14, 2007 at 9:40 pm

Hi Malcolm,
Thanks very much for your comments! It is my sincere hope that openness and transparency serves as the foundation for this blog. I am glad you have found the debate both useful and interesting.

Reply