Skip To Content

The Purgatory Database

So the NMRShiftDB
and our comparison to Modgraph and CSEARCH
has caused quite a stir.

Currently we are still awaiting Modgraph’s response on the following two
points:

  1. What
    is the overlap between NMRShiftDB and Modgraph’s NMR prediction databases?
    Further, with several different database sources how much duplication of data
    exists across the Modgraph databases?
  2. Once that overlap is removed from the dataset, what is the final deviation produced by NMRPredict?

Modgraph’s website currently claims the following two statements:

"Modgraph NMRPredict shows itself to be the most accurate carbon 13 NMR
predictor in an independent evaluation!"

"NMRPredict now contains the world’s largest collection of NMR data –
over 424,632 records in total!" (>345,000 are 13C records)

Let me tackle each statement one by one:

On accuracy:

Well I think we all know where we stand. Modgraph claims a greater accuracy
than us, but the question remains about their overlap and the validity of their
most recent results. Meanwhile, based on our evaluation of the NMRShiftDB we
have worked on tweaking our algorithms for version 11 and I will have some new
results to report soon. Stay Tuned.

On "the world’s largest collection of NMR data"
(>345,000 records of 13C NMR data):

If you go to Modgraph’s website here,
you will see how they handle predictions when there is an exact match in the
database. You may also notice on the right pane that there are MANY exact
matching records in the database. In fact, it turns out that there are 25
matching records in the database for Yohimbine within Modgraph NMRPredict
version 3.8.22. However, when you drill down into the database, you will
discover that these aren’t exact matches at all, just different stereoisomers
of Yohimbine, as well as from different sources with different experimental conditions.

But it did give us an idea on a quick experiment to run.

I ran a prediction for Benzene in Modgraph NMRPredict Version 3.8.22. The
prediction was very accurate. It turns out that there are 56 exact structure
matches for benzene in their prediction database. 56! And it appears to me that these matches are all logged
as DIFFERENT database records.

Off the
top of my head I ran a few more predictions:

38 exact structure matches for toluene

32 exact matches for acetone

39 exact matches for catechin

First of all, let me stress that it is very important to have different hits
in the database from different sources. For example, entering different solvent
data helps achieve solvent specific prediction. But my question is, are these indeed reported as different records as it appears? If so, this simply bloats the number of
records in the database and can be perceived as a very misleading number to
quote to the public when talking about NMR prediction. Is quoting the number of
records really a useful statistic to measure?

I think that a user would be more
interested in the number of unique compounds in a database.

In Modgraph’s defense, these records all seem to be different in some way.
They have different references, are in different solvents, temperatures,
concentrations, etc. Fair enough. But it should be known that in our software,
all this information is added in one record (click image to enlarge):

Benzeneacd

15 different sources for the CNMR assignments of benzene. Different
solvents, conditions, sources, references, etc. But keep in mind, all this
information is stored in and counted as one record!

We currently have over 186,000 records in our CNMR database. But these
represent unique structures in the database. If we have more than one source
for a chemical structure (and there are many, many instances) it all goes in
the same record. 

It seems like a good opportunity for me to explain exactly how ACD/Labs
builds prediction databases and why we do not import databases from external
sources.

Our database team culls the most recent literature, draws chemical
structures, and tabulates NMR assignments in what we like to call the purgatory
database
. Once entered, these data are screened and prioritized in a
way that ensures that the chemical shifts that have the most promise to improve
your predictions are on the top of the list. These records then sit in limbo
until one of our resident NMR spectroscopists is able to quality check, hence
the name purgatory.

We are often asked whether or not we would be willing to import data from
some other database such as SpecInfo, The Human Metabalome Project, NMRShiftDB,
Aldrich, etc.

Our answer is generally no. Our response is not because we question the
quality or validity of the data. It is simply because we have a purgatory
database that literally consists of hundreds of thousands of
compounds that are waiting for quality checking. This is data coming from the
most recent literature that we currently do not have in our database that is
prioritized in a way that is most beneficial to the structural diversity of our
databases.

This is the standard we have put in place and we truly believe it is the
best way for us to ultimately improve the prediction in our NMR software. If we
didn’t hold quality of science at a high standard within our company, we could
double the size of the database tomorrow by importing the entire purgatory
database as-is.

In closing, two conclusions from me:

  1. I believe that we have created the best practice to follow for the validation of prediction accuracy on an independent data set. Modgraph has apparently chosen not to follow this practice and have quoted very questionable numbers, in my opinion. If they aren’t going to produce those numbers in a consistent way, this is not a comparison and those numbers are meaningless with respect to ours. Again, we are open to suggestions on a better way or a different method. But those haven’t come either. Database issues aside, it’s the algorithms behind the predictions that are absolutely crucial. This is why the issue of overlap is so important. When ALL overlap is acknowledged and removed, it represents a fair measurement of algorithm performance.
  2. Modgraph currently has the largest collection of records in the world. But what’s important, the number of records, or the number of unique structures? For a database used to generate NMR predictions, the structural diversity of the database is key. While a large database of experimental values will likely improve NMR predictions, it has to be structurally diverse to be useful in the real world.

EDIT: This conversation has continued in the following entries (in order):

http://acdlabs.typepad.com/my_weblog/2007/07/final-note-on-t.html