CBU - Computational Biology Unit
 

Online tools The ratio of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitution rates is an indicator of selective pressures on genes, and can be used to identify pairwise combinations of genes or branches of gene phylogenetic trees, where encoded proteins may have changed function.

  • Warning: the service is not further developed, and provided as-is to the community

  • Please note that results are not ensured if the task runs for more than 2 hours

Scientific questions: David Liberles' group

Technical comments: services -at- cbu.uib.no

The Ka/Ks service

The ratio of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitution rates is an indicator of selective pressures on genes, and can be used to identify pairwise combinations of genes or branches of gene phylogenetic trees, where encoded proteins may have changed function. A ratio significantly greater than 1 indicates positive selective pressure. A ratio around 1 indicates either neutral evolution at the protein level or an averaging of sites under positive and negative selective pressures. A ratio less than 1 indicates pressures to conserve protein sequence. The average Ka/Ks ratio between human and rodent is ~0.2 (see for example, Wolfe and Sharp (1993) J. Mol. Evol. 37:441-456). For a more detailed description of the calculations done here, see Liberles, (2001) Mol. Biol. Evol. 18:2040-2047, Siltberg and Liberles (2002) J. Evol. Biol. 15:588-594, and papers referenced therein.

Sequence alternatives

The sequences should be in FASTA format. They can either be pasted in or uploaded from a file with FASTA formated sequences. The sequences can be either just sequences, or be prealigned, padded with '-' for gaps. Since Ka/Ks calculations work on codons, gaps must be inserted in groups of three to keep codons aligned. If the number of nucleotides not are divisible by three, the last one or two nucleotides are ignored in the calculations. Codons that contain wildcard nucleotides (like n, r or y), or non-nucleotide characters are replaced by gaps before the Ka/Ks calculations.

Phylogenetic tree options

The Ka/Ks service uses binary rooted phylogenetic trees in Newick format (also known as New Hampshire format) or Schreiber format. Newick format is described at http://evolution.genetics.washington.edu/phylip/newicktree.html. We don't support nexus format directly, but a newick tree is embedded in this format, so you can always cut and paste it from a nexus file. The algorithm assumes that all input trees are rooted and binary.

Calculate the Multiple sequence alignment and the phylogenetic tree from the sequences

This is the simplest (and least accurate) method. The only required input is unaligned sequences. The Ka/Ks tool then translate the DNA-sequences to protein, calculates the alignment, and then transforms it back to DNA, so it is aligned on codon boundaries, as required for Ka/Ks calculation. This alignment is then used to calculate a phylogenetic tree using a least squares distance method, using Jukes and Cantors distances on the DNA.

Calculate phylogenetic tree from prealigned sequences

It is usually desirable to check the alignment before sending it to the Ka/Ks service. This option requires the user to input fasta formated sequences with inserted gaps. The gaps must be on codon boundaries (i.e in groups of three). The tree is calculated using a least squares distance method, using Jukes and Cantors distances on the DNA.

Paste in the tree

Here a tree must be given the service by pasting the tree in the textfield below the checkbox. This option requires the user to input fasta formated sequences with inserted gaps. The gaps must be on codon boundaries as in the option above.

Upload the tree from file

As the alternative above, but the tree is uploaded from file instead of being pasted.

Tree examples

Given the phylogenetic tree of three sequences (seq1, seq2 and seq3) below
     |
    / \2
  1/   \
  /\    \
 /  \    \
seq1 seq2 seq3

The newick formated tree will describe it as a set of nested parantheses, so the tree will look like:
((seq1,seq2),seq3);
The Schreiber tree is a 'flatened' array representation, where each element represents an internal node in the tree. As a rule, the internal nodes must be numbered so that nodes closer to the root have higher numbers than their child nodes. The node numbers of the internal nodes is printed on the tree above. Using these trees the schreiber tree looks as follows:

n:=[[1,2],[3,[1]]:

the first element [1,2], means that leaf node number 1 and leaf node number 2 is joined. This means that the FASTA formated sequences in the sequence input file must appear in the right order. The second element [3,[1]] ,means that the third sequence is is joined with the first internal node. That the number one is in brackets indicates that this is an internal node rather than a leaf node.

Mapping from tree leaf nodes to sequences in newick format

In the newick format the leaf nodes is mapped to sequences by matching the node name against the sequence names. If the node name is found as a substring in the sequence name. This means that the node name 'U92715.1', will match the sequence named 'gi:3237305|ga:U92715.1|Homo sapiens|breast cancer antiestrogen resistance 3 protein', because the sequence contains the nodename. The downside of this scheme is that the node '1' will match both the sequence named '1' and the one named '10'.

Calculate MSA and tree from the sequences

In this case the input sequences are unaligned, and the service will calculate the multiple sequence alignment using clustalw, and then the tree from the alignment using a minimum square method, with DNA distances calculated using Jukes and Cantor's method.

Calculate tree from prealigned sequences

If this option is chosen, the sequences are assumed to be prealigned, with gaps represented by a dash '-' character. The sequences including gaps must all be of the same length, and codons must always be aligned. The tree is calculated from the alignment using a minimum square method, with DNA distances calculated using Jukes and Cantor's method.

Paste in tree

If this option is chosen, the sequences are assumed to be prealigned, and the tree (in screiber format) can pasted in to the textfield below.

Upload tree from file

If this option is chosen, the sequences are assumed to be prealigned, and the tree (in screiber format) can be uploaded from file.

Codon bias

The codon bias option corrects for a fixed codon bias within the dataset (but not for directional selection between different sequences in the data). Tables of codon bias for different species can be found at http://www.kazusa.or.jp/codon/.

To use this option, insert a Darwin format matrix where the position of a codon corresponds to the Darwin codon numbers (see GenCodeToInt). Values should reflect the representation of a codon encoding a specific amino acid, where the neutral expectation is a value of 1 in each position.

To calculate, a 4 fold degenerate codon with the usages of 23, 34, 25, and 108 would obtain the values 0.48, 0.72, 0.53, and 2.27.

For example,


codonbias := [0.76,1.29,1.28,0.79,0.89,1.73,0.58,0.85,1.09,1.41,1.17,0.65,0.39,1.78,1,0.95,0.43,
              1.22,1.5, 0.63,0.85,1.37,0.53,0.94,0.56,1.29,1.2, 0.43,0.34,1.39,2.82,0.66,0.73,1.21,
              1.23,0.8,0.73,1.81,0.46,0.97,0.88,1.52,0.99,0.57,0.34,1.18,2.13,0.59,1,1.37,1,0.77,
              0.6,1.37,0.36,0.89,1,1.34,1,0.86,0.31,1.33,0.69,0.88]:
corresponds to the human codon usage. The codon bias format is a darwin array with 64 elements assigned to the variable codonbias, like in the example above

None

No codon bias is used in the algorithm

Paste

Paste in the codon bias matrix in the textfield below

Upload

Upload the codon bias matrix from a local file.

GC content

The GC content option allows for correction of a fixed nonrandom usage of GC vs. AT nucleotide composition as a selective pressure. Again, directional selection for GC content is not corrected for.

Determine

The GC content is determined by averaging over all of the sequences.

Ignore

ignore GC content.

Choose

User supplied. A precentage of GC codons to be used by the algorithm must be supplied.

Windowing options

Because a subset of sites within a gene may be under positive selective pressure while other sites are under conservative selective pressure, windowing allows one to identify a section of a sequence where the forces of positive selection may be operating (eg a binding surface). A method based upon windows in 3D protein structures will be available soon.

None

No windows are used and the whole sequence is averaged together.

Primary

Primary sequence windowing looks at contiguous blocks in the primary sequence. This method is based upon the approach of Endo et al. (1996) Mol. Biol. Evol. 13:685-690. An improved approach that is not yet available from this site has been described by Fares et al. (2002) J. Mol. Evol. 55:509-521.

Variable

Variable windowing looks at the subset of residues that are candidates for positive selection under the model of Miyamoto and Fitch (see Miyamoto and Fitch (1995) Mol. Biol. Evol. 12:503-513). Ka/Ks is calculated using only the potentially variable sites. This method is described more fully in Siltberg and Liberles (2002) J. Evol. Biol. 15:588-594.

Weighting

Branch length weighting of substitutions in the parsimony approach is possible. Branch lengths are measured as NED distances (see Peltier et al. (2000) J. Exp. Zoo. 288:165-174) based upon DNA (measured from synonymous pyrimidine transititions in 2-fold degenerate codons) or as PAM distances measured from proteins using a UPGMA-based algorithm. NED branch lengths can also be estimated using an iterative approach based upon adjustment using a joint reconstruction-like algorithm (NEDexp).

Tree method

Methods involve parsimony reconstruction based upon DNA sequence only, based upon DNA+protein, and a pairwise method with no reconstruction. A maximum likelihood based reconstruction is currently available (and an improved version of this will be available soon). Also available soon will be a UPGMA extraction of Ka and KS values converted to ratios from pairwise sequence comparison. See Liberles, (2001) Mol. Biol. Evol. 18:2040-2047, Pupko et al. (2000) Mol. Biol. Evol. 17:890-896, and Zhang et al. (1998) Proc. Natl. Acad. Sci., USA 95:3708-3713.

Submatrix

Several different substitution matrices are available for use. These include the discrete treatment of the Grantham matrix (Grantham (1974) Science 185:862-864) originally used by Li et al. (1985) Mol. Biol. Evol. 2:150-174, a continuous treatment, and the Taylor-Jones (Taylor and Jones (1993) J. Theor. Biol. 164:65-83) matrix and a matrix derived from Zhang (2000) J. Mol. Evol. 50:56-68. All treatments seem to give similar results in most cases except for the continuous treatment of the Grantham matrix (again, see Liberles, (2001) Mol. Biol. Evol. 18:2040-2047).

Li rate

The Li rate options correspond to the evolutionary model as originally described in Li et al. (1985) Mol. Biol. Evol. 2:150-174. A moderate rate seems to be appropriate for most genes.

This page is maintained by webmaster@bccs.uib.no. Last updated: Tuesday 12 February, 2008
Unifob logo    UiB logo