pKALM: Accurate and Rapid Protein pKa Prediction from Sequence

pKALM[1] predicts protein pKa values directly from amino-acid sequence using a protein language model and transfer learning — no structure required. It reaches state-of-the-art accuracy (0.8658 RMSE) at a throughput of ~4,965 pKa/s, covering six ionizable side chains (Asp, Glu, His, Lys, Cys, Tyr) plus the N- and C-termini.

Run a Prediction

Paste one or more sequences in FASTA format or upload a FASTA file, pick a model, and submit. Each run opens a dedicated, bookmarkable result page.

Up to 2,000 residues per job. For longer sequences, please contact us.
or
Accepts .fasta, .fa, .txt, or .seq.

Predictions currently run on a CPU server. pKALM is sequence-based and fast, so most jobs finish in seconds to a minute. The result page shows live progress, can be bookmarked or shared, and is retained for one week.

How pKALM Works

Architecture of pKALM: a protein language model, protein and peptide pI models, and a residue embedding feed a BiLSTM whose output is added to standard pKa values; a separate pI model predicts isoelectric point.
Figure 1: Overview of pKALM. (a) The pKa predictor concatenates features from a frozen protein language model (ESM2)[2], protein and peptide pI models, and a residue embedding, feeds them to a BiLSTM, and adds the learned shift to the standard pKa. (b) The shared pI model architecture (residue embedding → BiLSTM → average → linear).

Performance

pKALM is benchmarked against widely used pKa predictors — the empirical method PROPKA[3], the Poisson–Boltzmann solver PypKa[4], and the deep-learning methods pKAI / pKAI+[5] and DeepKa[6].

Bar charts comparing pKALM RMSE against PKAI+, DeepKa, PKAI, PypKa and PropKa for abundant residues (Asp, Glu, His, Lys) and less abundant residues (Cys, Tyr, N- and C-terminus).
Figure 2: pKa RMSE vs. existing methods. (a) Abundant residues — pKALM gives the lowest RMSE on Asp, His and Lys. (b) Less abundant residues — pKALM uniquely predicts Cys and leads on Tyr and the termini, where several structure-based methods are unavailable (marked "X").

Case Studies

Beyond aggregate metrics, pKALM recovers chemically meaningful sites directly from sequence — including catalytic residues whose pKa is strongly perturbed by their environment.

Three protein structures with predicted ionizable residues highlighted: ribonuclease H, the chymotrypsin catalytic triad, and the bacterial peroxiredoxin AhpC with residues having high attention to the catalytic Cys46.
Figure 3: Exemplary predictions. (a) The five titrated residues of ribonuclease H. (b) The catalytic triad of chymotrypsin — the shifted His57 pKa (predicted 7.65 vs. experimental 7.5) is captured from sequence alone. (c) The bacterial peroxiredoxin AhpC: the catalytic Cys46 (red) and residues with high attention to it (orange) are distant in sequence but close in 3D, suggesting the PLM encodes long-range structural context.

Supported Ionizable Groups

Side chains
AspGlu HisLys CysTyr
Termini
N-terminusC-terminus

Each job also returns predicted isoelectric points (pI) for the peptide and protein models, appended to the results.

References

Please contact shijie.xu@ees.hokudai.ac.jp for any questions.

Changelogs