Application Note

Augmentation of ACD/Labs pK_a® Prediction Algorithm with Proprietary High-Quality Data

Eduard Kolovanov,¹ Alexander Proskura,¹ Susanne Winiwarter²

¹Advanced Chemistry Development, Inc. (ACD/Labs), 8 King Street East, Toronto, ON, M5C 1B5, Canada
² Drug Metabolism and Pharmacokinetics, Research and Early Development Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, 431 83, Sweden

Introduction

In drug discovery, pK_a is a key compound property that governs ionization, influencing molecular interactions and playing a critical role in potency, toxicity, and pharmacokinetics. During the early stages of compound design and discovery, researchers rely on predictive models, such as the ACD/Labs Percepta® software, which applies Hammett equations to estimate pK_a values. However, at later stages, or when predictions are found to be less reliable for a specific compound class, pK_a measurements are used to ensure accuracy.

To further enhance the precision of the ACD/Labs pK_a prediction algorithm, we collaborated with AstraZeneca, leveraging over five years of their experimentally measured pK_a data. This collaboration enabled us to refine our software, delivering even more precise and reliable estimations for drug development.

Methods

pK_a values measured over a five-year period were extracted from AstraZeneca’s internal database, carefully reviewed for obvious errors, and then sent to ACD/Labs for further processing and algorithm training.

Figure 1: Distribution of ionization formulae in Data Set 1 (>1100 compounds with >2500 pKa values, determined by UV metrics or pH metrics method.

The data underwent additional scrutiny, with contradictory entries removed, and structural features identified for potential improvements. Using Hammett equations, which calculate pK_a values based on structural fragments, the dataset was applied to further enhance prediction accuracy.

pK_{a_calculated} = pK_a₀ + ƒ(σ_I, σ_R, σ_meta, σ_para)

New fragments were generated from the dataset, leading to the development of new full Hammett equations or updated pK_a₀ values where appropriate. These refinements resulted in noticeable prediction improvements for Data Set 1.

To further evaluate the model performance, an external test set (Data Set 2) was utilized, comprising of over 1,000 compounds with more than 2,200 experimentally determined pK_a values, measured using UV-metric or pH-metric methods. Unlike Data Set 1, this dataset underwent minimal refinement, and no adjustments were made for tautomeric forms or pK_a values.

Additionally, the similarity between compounds in the training and test sets were also not assessed. However, compounds with multiple inconsistent measurements were excluded to maintain data integrity.

Figure 2: Distribution of ionization formulae in Data Set 2.

Results

Data Cleaning Example

An essential step in data refinement was selecting the most appropriate tautomeric form to ensure accurate pK_a predictions. Scheme 1 illustrates two examples where the tautomeric form on the left was not considered ionizable.

Scheme 1: Examples of tautomeric forms considered.

Additionally, contradictory data were identified and examined. Scheme 2 highlights examples of similar compounds with two amine groups within the same molecule. The pK_a of the amine adjacent to the aryl group is influenced by its proximity to the primary amine, with a shorter distance typically resulting in a lower pK_a value. However, this expected pattern was not always observed in experimental data, indicating a possible discrepancy.

Scheme 2: Examples of similar molecules, each containing two amine groups.

Software Improvement

New pK_a₀ values were specified for over 400 fragments, while for more than 70 fragments new or updated Hammett equations were defined. Additionally, two new dissociating centers were identified.

Scheme 3: The two new dissociated centers which were identified.

After training, pK_a for 98.7% of compounds was predicted within 1 log unit, with 84.1% predicted within 0.5 log units. This compares to only 72% of compounds being predicted within 1 log unit, as well as numerous compounds being predicted beyond 2 log units accuracy before training. Thirteen ionization centers were not predicted at all by the old algorithm. Table 1 details the training results, showing the number of data points predicted within the specified limits based on pairwise comparisons of the closest values.

Table 1: Training result details of number of data points predicted within a limit.

	N<0.5	N<1.0	N<2.0	N<3.0	N<8.0	N
Before Training	47.6%	72.3%	94.0%	98.4%	100%	2249
After Training	84.1%	98.7%	100%	–	–	2262

Calculation statistics for Data Set 1 are also presented on Figures 3a/b.

Figure 3a/b: Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for Data Set 1.

Predictions for Data Set 2

To assess the effectiveness of these improvements, predictions from the old and new pK_a prediction algorithms were compared using Data Set 2, which included both acidic and basic pK_a values. The graphs below show a comparison of the most acidic (A1) and most basic (B1) values specifically. The blue regression line and purple unity line illustrate the relationship between predicted and observed values, demonstrating the improved predictive performance achieved through these refinements.

Additionally, Figure 4b highlights the significant improvement in prediction accuracy for compounds containing a piperazine next to an aromatic ring after training.

Figure 4a/b: Visual comparison of predicted vs. experimental pKa values before (a) and after (b) training for the most acidic values.

Figure 4c/d: Visual comparison of predicted vs. experimental pKa values before (c) and after (d) training for the most basic values.

Conclusions

The integration of AstraZeneca’s high-quality experimental data into the ACD/Labs pK_a prediction algorithm led to the identification of two new dissociating centers. The updated prediction model demonstrated greater accuracy in training data, and further evaluation with Data Set 2 indicated improved predictions, particularly for basic compounds containing piperazines adjacent to an aromatic ring.

A quick inspection revealed that there were at least two significant outliers in Figure 4: One involved an acidic compound where the most acidic pK_a was outside the experimental range, causing pK_a A1 to be assigned an artificially high value (Figure 4a). The second case involved a basic compound, where the experimental pK_a values for B1 and B2 were inadvertently swapped (Figure 4b) and hence the experimental value for B1 in the plot was greatly underestimated. Similar discrepancies may affect the statistical significance of the observed improvements. However, a more detailed analysis of the experimental data was beyond the scope of this investigation.

Analysis of the results of Data Set 2 shows that for a complex property like pK_a it is necessary to continuously expand the training dataset. This is due to the ongoing emergence of new classes of compounds that are not adequately represented by the current predictive algorithm. However, for classes of compounds similar to those already present in the training dataset, significant improvements in prediction accuracy are observed.

Download to read offline.

Download Application Note

Other Resources

Application Note

Improving pK_a Prediction Accuracy for PROTACs

Predicting the pK_a of large molecules such as PROTACs is a major challenge in drug discovery. Our latest study shows how our ACD/pK_a Classic algorithm delivers reliable predictions, outperforming some experimental data. Learn how new datasets help our latest model achieve even greater accuracy, showcasing our ongoing commitment in advancing drug development.

Blog

The Importance of Ionization in Pharmaceutical R&D

Why pKa Values are Relevant to Scientists in Pharma/BioTech Many of the small molecules under investigation in pharmaceutical and biopharmaceutical…

Application Note

Augmentation of ACD/Labs pKa® Prediction Algorithm with Proprietary High-Quality Data

Introduction

Methods

pKacalculated = pKa0 + ƒ(σI, σR, σmeta, σpara)

Results

Data Cleaning Example

Software Improvement

Predictions for Data Set 2

Conclusions

Other Resources

Improving pKa Prediction Accuracy for PROTACs

The Importance of Ionization in Pharmaceutical R&D

Send me more info!

Augmentation of ACD/Labs pK_a® Prediction Algorithm with Proprietary High-Quality Data

pK_{a_calculated} = pK_a₀ + ƒ(σ_I, σ_R, σ_meta, σ_para)

Improving pK_a Prediction Accuracy for PROTACs