PsiPartition: Improved Site Partitioning for Genomic Data by Parameterized Sorting Indices and Bayesian Optimization

PsiPartition[1] is a tree-independent site-partitioning method for phylogenetic analysis. It scores every site with a Parameterized Sorting Index (PSI) and uses Bayesian optimization to automatically determine the best partitioning scheme — without requiring a reference tree or prior knowledge of the data. This yields more accurate reconstructions, especially for large genomic datasets with strong site heterogeneity. [ GitHub | Paper ]

Introduction

DNA (deoxyribonucleic acid) is the blueprint of life. It carries the instructions to make proteins, which are essential for all living things. These instructions are written using a code made of three-letter "words" called codons (Figure 1). The same codon code is used by almost all life forms, showing how all living things are connected. However, the code is not one-to-one, meaning that multiple codons can code for the same amino acid. This redundancy is called "degeneracy" and is thought to be a result of evolution. The difference of code for the same amino acids usually happens at the third position of the codon, which is called the "synonymous" site. In contrast, the first and second positions are called "non-synonymous" sites. Such difference means that the third position is less important for the protein structure and function, and is under less selective pressure.

The standard genetic code table
Figure 1: The standard genetic code. Each three-letter codon specifies one amino acid (shown with its three- and one-letter codes), coloured by side-chain property. Note the redundancy — many amino acids are encoded by several codons that differ only at the third position.

The difference in selective pressure between synonymous and non-synonymous sites are considered in the partitioned models in phylogenetic inference. These models assume that different sites in the sequence alignment have different evolutionary rates, and use different substitution matrices for different sites. The partitioned models can improve the accuracy of phylogenetic inference, especially for large genomic data with more site heterogeneity. However, the partitioned models require the user to specify the number of partitions and the sites in each partition. This is a challenging task, as the user needs to have prior knowledge of the data to make the partitioning.

Partitioned models
Figure 2: Partitioned models in phylogenetic inference. The models assume that different sites in the sequence alignment have different evolutionary rates, and use different substitution matrices for different sites.

How PsiPartition Works

The hard part of a partitioned model is deciding how many partitions to use and which sites go into each one. The number of possible partitionings of an alignment grows faster than exponentially with the number of sites (the Bell number), so an exhaustive search is impossible. Existing automatic methods either depend on a reconstructed reference tree — which is slow to obtain and may itself be wrong — or rely on greedy search that can get trapped in local optima.

PsiPartition avoids both problems with two ideas:

Sites are then sorted by their PSI and binned into the chosen number of partitions, so slowly-evolving conserved sites and fast-evolving variable sites end up in different partitions.

Distribution of sites across PSI bins
Figure 3: The number of sites in each PSI bin produced by PsiPartition on eight empirical DNA datasets. Invariant (blue) and variant (red) sites are separated across bins instead of being lumped together, so the partitioning reflects real differences in evolutionary rate. Adapted from the paper.

Performance

Across eight empirical DNA datasets, PsiPartition and PsiPartitionFast fit the data substantially better than existing partitioning methods (mPartition[4], RatePartition[5]), giving much lower Bayesian Information Criterion (BIC) and corrected Akaike Information Criterion (AICc) values.

BIC and AICc comparison across partitioning methods
Figure 4: Improvement in model fit (lower ΔBIC and ΔAICc is better) of different partitioning methods on the empirical DNA datasets. PsiPartition (dark red) consistently achieves the best fit. Adapted from the paper.

More importantly, better model fit translates into more accurate trees. On simulated data with heterogeneous evolutionary rates, PsiPartition reconstructs the trees closest to the truth — the smallest Robinson–Foulds (RF) distance[6] — and its advantage grows as the number of loci increases. It also outperforms existing methods on empirical protein datasets.

Robinson-Foulds distance to the true tree
Figure 5: Average Robinson–Foulds distance between the reconstructed and true trees on simulated data (lower is more accurate). PsiPartition (red) gives the most accurate reconstructions across numbers of loci. Adapted from the paper.

How to Use PsiPartition

Step 1: Preparation

Before using PsiPartition, please prepare the following things:

  1. A sequence alignment in FASTA format. Suppose you have a query sequence and you want to infer its phylogenetic relationship with other sequences. You need to search for homologous sequences from the database such as UniProt and align them by using software such as COBALT.
  2. Phylogenetic software. PsiPartition is based on partitioned models, and it does not perform phylogenetic inference itself. We use IQ-TREE as our host software. Click the link and unzip the files into some folder. You can test if it works by running:
    ./bin/iqtree2.exe -s example.phy
    IQ-TREE
    Figure 6: IQ-TREE phylogenetic software. The software is used to infer phylogenetic trees from sequence alignments.
  3. Python. PsiPartition is written in Python, so you need to have Python installed on your computer. You can download Python from here.
  4. PsiPartition. Download the PsiPartition software from here. Unzip it somewhere, go to the folder and install required packages:
    pip install -r requirements.txt
  5. A Weights & Biases account. PsiPartition uses Weights & Biases to log the optimization process. You need to sign up for an account and get your API key.

Step 2: Run PsiPartition

After you have prepared the above things, you can run PsiPartition by following the command below:

python PsiPartition_wandb.py --msa MSA_File --format fasta --alphabet dna --max_partitions 5 --n_iter 100

The arguments are:

PsiPartition output
Figure 7: Example output of PsiPartition.

The *.iqtree file contains the reconstructed phylogenetic tree. You can visualize the tree using software such as iTOL. In addition, the file *.parts contains the optimized partitioning scheme. You can use this file to analyze the data with partitioned models in IQ-TREE:

./bin/iqtree2.exe -s example.phy -spp example.parts

References

Users are kindly requested to utilize the following citation when referencing this method:

Please contact shijie.xu@ees.hokudai.ac.jp for any questions.

Changelogs