Skip To Content
Back to Resources
Application Note

Augmentation of ACD/Labs pKa® Prediction Algorithm with Proprietary High-Quality Data

Eduard Kolovanov,1 Alexander Proskura,1 Susanne Winiwarter2

1 Advanced Chemistry Development, Inc. (ACD/Labs), 8 King Street East, Toronto, ON, M5C 1B5, Canada
2 Drug Metabolism and Pharmacokinetics, Research and Early Development Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, 431 83, Sweden

Introduction

In drug discovery, pKa is a key compound property that governs ionization, influencing molecular interactions and playing a critical role in potency, toxicity, and pharmacokinetics. During the early stages of compound design and discovery, researchers rely on predictive models, such as the ACD/Labs Percepta® software, which applies Hammett equations to estimate pKa values. However, at later stages, or when predictions are found to be less reliable for a specific compound class, pKa measurements are used to ensure accuracy.

To further enhance the precision of the ACD/Labs pKa prediction algorithm, we collaborated with AstraZeneca, leveraging over five years of their experimentally measured pKa data. This collaboration enabled us to refine our software, delivering even more precise and reliable estimations for drug development.

Methods

pKa values measured over a five-year period were extracted from AstraZeneca’s internal database, carefully reviewed for obvious errors, and then sent to ACD/Labs for further processing and algorithm training.

Distribution of ionization formulae in Data Set 1 (>1100 compounds with >2500 pKa values, determined by UV metrics or pH metrics method.
Figure 1: Distribution of ionization formulae in Data Set 1 (>1100 compounds with >2500 pKa values, determined by UV metrics or pH metrics method.

The data underwent additional scrutiny, with contradictory entries removed, and structural features identified for potential improvements. Using Hammett equations, which calculate pKa values based on structural fragments, the dataset was applied to further enhance prediction accuracy.

pKacalculated = pKa0 + ƒ(σI, σR, σmeta, σpara)

New fragments were generated from the dataset, leading to the development of new full Hammett equations or updated pKa₀ values where appropriate. These refinements resulted in noticeable prediction improvements for Data Set 1.

To further evaluate the model performance, an external test set (Data Set 2) was utilized, comprising of over 1,000 compounds with more than 2,200 experimentally determined pKa values, measured using UV-metric or pH-metric methods. Unlike Data Set 1, this dataset underwent minimal refinement, and no adjustments were made for tautomeric forms or pKa values.

Additionally, the similarity between compounds in the training and test sets were also not assessed. However, compounds with multiple inconsistent measurements were excluded to maintain data integrity.

Distribution of ionization formulae in Data Set 2.
Figure 2: Distribution of ionization formulae in Data Set 2.

Results

Data Cleaning Example

An essential step in data refinement was selecting the most appropriate tautomeric form to ensure accurate pKa predictions. Scheme 1 illustrates two examples where the tautomeric form on the left was not considered ionizable.

Examples of tautomeric forms considered.
Scheme 1: Examples of tautomeric forms considered.

Additionally, contradictory data were identified and examined. Scheme 2 highlights examples of similar compounds with two amine groups within the same molecule. The pKa of the amine adjacent to the aryl group is influenced by its proximity to the primary amine, with a shorter distance typically resulting in a lower pKa value. However, this expected pattern was not always observed in experimental data, indicating a possible discrepancy.

Examples of similar molecules, each containing two amine groups.
Scheme 2: Examples of similar molecules, each containing two amine groups.

Software Improvement

New pKa₀ values were specified for over 400 fragments, while for more than 70 fragments new or updated Hammett equations were defined. Additionally, two new dissociating centers were identified.

The two new dissociated centers which were identified.
Scheme 3: The two new dissociated centers which were identified.

After training, pKa for 98.7% of compounds was predicted within 1 log unit, with 84.1% predicted within 0.5 log units. This compares to only 72% of compounds being predicted within 1 log unit, as well as numerous compounds being predicted beyond 2 log units accuracy before training. Thirteen ionization centers were not predicted at all by the old algorithm. Table 1 details the training results, showing the number of data points predicted within the specified limits based on pairwise comparisons of the closest values.

Table 1: Training result details of number of data points predicted within a limit.

  N<0.5 N<1.0 N<2.0 N<3.0 N<8.0 N
Before Training 47.6% 72.3% 94.0% 98.4% 100% 2249
After Training 84.1% 98.7% 100% 2262

 

Calculation statistics for Data Set 1 are also presented on Figures 3a/b.

Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for Data Set 1.
Figure 3a/b: Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for Data Set 1.

Predictions for Data Set 2

To assess the effectiveness of these improvements, predictions from the old and new pKa prediction algorithms were compared using Data Set 2, which included both acidic and basic pKa values. The graphs below show a comparison of the most acidic (A1) and most basic (B1) values specifically. The blue regression line and purple unity line illustrate the relationship between predicted and observed values, demonstrating the improved predictive performance achieved through these refinements.

Additionally, Figure 4b highlights the significant improvement in prediction accuracy for compounds containing a piperazine next to an aromatic ring after training.

Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for the most acidic values.
Figure 4a/b: Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for the most acidic values.

Visual comparison of predicted vs. experimental pKa values before (c) and after (d) training for the most basic values.
Figure 4c/d: Visual comparison of predicted vs. experimental pKa values before (c) and after (d) training for the most basic values.

Conclusions

The integration of AstraZeneca’s high-quality experimental data into the ACD/Labs pKa prediction algorithm led to the identification of two new dissociating centers. The updated prediction model demonstrated greater accuracy in training data, and further evaluation with Data Set 2 indicated improved predictions, particularly for basic compounds containing piperazines adjacent to an aromatic ring.

A quick inspection revealed that there were at least two significant outliers in Figure 4: One involved an acidic compound where the most acidic pKa was outside the experimental range, causing pKa A1 to be assigned an artificially high value (Figure 4a). The second case involved a basic compound, where the experimental pKa values for B1 and B2 were inadvertently swapped (Figure 4b) and hence the experimental value for B1 in the plot was greatly underestimated. Similar discrepancies may affect the statistical significance of the observed improvements. However, a more detailed analysis of the experimental data was beyond the scope of this investigation.

Analysis of the results of Data Set 2 shows that for a complex property like pKa it is necessary to continuously expand the training dataset. This is due to the ongoing emergence of new classes of compounds that are not adequately represented by the current predictive algorithm. However, for classes of compounds similar to those already present in the training dataset, significant improvements in prediction accuracy are observed.

Download to read offline.

Download Application Note