Phylogenetic Analysis Programs
v3.1
Home
User Guide
Methods
FAQ
Credits
Embedded Tools
Site History
Contact

Methods






DATA ACQUISITION

Input

To run PhyleasProg you must provide two types of information:
A list of Ensembl protein IDs by separating them with comma, space or new line characters. Be sure that your IDs correspond to Ensembl protein IDs by verifying that (i) IDs start with "ENS" and (ii) the last letter of the ID must be a "P" (e.g. ENSMUSP00000099398). You can find your protein ID in Ensembl web site www.ensembl.org, in your protein page. This list of IDs will constitute the base of your run of PhyleasProg where each ID will be treated independently from each other.
A list of species of interest. You have to choose your species between two lists (1) fully sequenced vertebrates and (2) not fully sequenced vertebrates. You can then consult your list of IDs and the selected species at each moment of the computation by clicking on "Job summary page".

Interrogation of Ensembl database

Once protein IDs and a list of species are provided, PhyleasProg will interrogate Ensembl database to retrieve the following data for each ID of your job:
  • The ID of the corresponding gene.
  • The sequence of the protein.
  • A list of protein sequences of paralogous genes.
  • A list of protein sequences of orthologous genes.
  • For each orthologous gene, a list of its paralogous genes and their sequences.


RECONSTRUCTION OF PHYLOGENETIC TREES

PhyleasProg will reconstruct 2 types of phylogenetic tree for each submitted ID:
  • A phylogenetic tree of orthologues of your gene.
  • A phylogenetic tree of paralogues of your gene, with one different tree for paralogues of each species.

Multiple sequence alignments

Based on protein sequences retrieved from Ensembl database, two multiple sequence alignments are performed with Muscle or Prank, depending on the chosen option for the PhyleasProg computation (fast or fine):
  • One alignment for orthologous genes.
  • One alignment for each set of paralogous genes, if the computation is also performed on paralogs.
Multiple sequence alignments of proteins are then converted into codon alignments by PAL2NAL. When the fine computation is chosen, Prank is used for performing the multiple sequence alignment and an editing of the resulting alignments is performed by Gblocks with strict parameters. If the length of the "clean alignment" is lesser than 30% of the "raw alignment", the shortest sequence is removed from the dataset and a new alignment is performed. If the length of the "clean alignment" is between 30% and 50%, a new editing with Gblocks is performed on the "raw alignment" with relaxed parameters. Note that PRANK used the phylogenetic tree to perform the multiple sequence alignment, so if the phylogenetic tree is false, the resulting multiple sequence alignment will be also false. We recommend to the PhyleasProg user to carefully examine the displayed phylogenetic tree and multiple sequence alignment before interpreting PhyleasProg results.

Phylogenetic reconstruction

The resulting "clean alignment" from the previous step is used to reconstruct the phylogenetic tree by TreeBest.

Visualization

Archaeopteryx is the application used for the visualization of the phylogenetic trees.


POSITIVE SELECTION CALCULATIONS

To calculate positive selection on your datasets, PhyleasProg use the program codeml from the package PAML. This program evaluates the ratio of non-synonymous/synonymous substitutions rates (dN/dS), called ω, which is a measure of selective pressure. Values of ω < 1, = 1, and > 1 are indicators of purifying selection, neutral evolution, and positive selection. Two distinct categories of codon substitution models are used: site models and branch-site models. For the two types of analyses, 2 models are compared (one model that allows the positive selection and one model that does not allow positive selection). For each model, the lnL (log likelihood) value is retrieved (lnL1 for the model allowing positive selection, lnL0 for the other) and a LRT (Likelihood Ratio Test) calculation is performed (LRT= 2 x (lnL1-lnL0)) to assess the significance of the results. The LRT value follows a χ² curve so we can get the p-value of the LRT. If the LRT is significant for the comparison, PhyleasProg gives you sites under positive selection detected by Bayes Empirical Bayes (BEB) with posterior probabilities greater than 95% or 99% and sites under purifying selection..

The "site models"

These models allow the ω ratio to vary among sites (codons). Five models and three comparisons are used in PhyleasProg: M1a (0< ω0 <1 and ω1 =1;) vs M2a (0< ω0 <1, ω1 =1 and ω2 >1) (Wong et al. 2004 and Yang et al. 2005), M7 (0<ω<1) vs M8 (0<ω <1 and ωs >1) (Yang et al. 2000) and M8 vs M8a (0< ω <1 and ωs =1) (Swanson et al. 2003). The two models M2a and M8 allow positive selection whereas M1a, M7 and M8a models does not allow positive selection.
Real data analyses and computer simulations (Anisimova, Bielawski, and Yang 2001, 2002; Anisimova, Nielsen, and Yang 2003; Wong et al. 2004) suggest that the two pairs of site models M1a-M2a and M7-M8 are particularly effective. Note that the M1a-M2a comparison appears to be more robust (or less powerful) than the M7-M8 comparison. The M8a model is a modified version of the M8 model where ωs=1 whereas ωs>1 in M8. The M8-M8a comparison may in some cases have more power than the M7-M8 comparison because of the reduction in the degrees of freedom (df = 1 and df=2, respectively). If a category of positively selected sites with a value of ω is only slightly larger than one, the M8-M8a comparison may be less powerful than M7-M8 comparison (Swanson et al. 2003).
For more details see the PAML documentation displayed on http://abacus.gene.ucl.ac.uk/software/paml.html.

In PhyleasProg, only comparisons with significant LRT are presented. If more than one comparison has a significant LRT, we suggest you to explore all results pages to determine if positively selected sites are the same between models. If amino acids are detected by more than one comparison, the possibility that they may be under positive selection is increased.

The "branch-site models"

These models allow ω ratio to vary both among sites in the protein and across branches on the tree and aim to detect positive selection affecting a few sites along particular lineages (foreground branches). In PhyleasProg, all branches of the tree are tested as foreground branch for positive selection. Two models are used, one called Alternative and one called Null. In the alternative model, three classes of sites are admitted for the foreground branch, ω0: dN/dS < 1, ω1: dN/dS = 1 and ω2: dN/dS ≥ 1. In the Null model, ω2 is fixed to 1.

Visualization

Results of positive selection calculation are visualized onto 1D and onto 3D structure on the same result page. For the two types of representation, a color scale is used to distinguish the different values of ω for each site. The scale from light to dark green represents purifying selection, while orange and red represents positive selection with posterior probabilities greater than 95% or 99%, respectively.
For the modeling of the 3D structure, a BLAST is performed to find an approaching structure in PDB database in order to use it as a template to calculate a model with Modeller. If a PDB sequence matches correctly with submitted protein, evolutionary results are directly visualized onto its modeled structure.


SYNTENY EXPLORATION

In parallel of these steps of calculations, PhyleasProg provides you a link toward the genome browser Genomicus. Genomicus enables you to navigate in genomes and to explore synteny of your gene of interest.