SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins


Algorithm in details



Version 1.0

This is the first release of the program. If you observe a bug or suspicious behavior, or get a nonsense result, please send us a note, containing the query alignment and the parameters you've used. Thank you.


Consider a multiple protein sequence alignment. The proteins are divided into N specificity groups, numbered by i=1,...,N. The goal in to identify columns (positions) in the alignment, in which the amino acid distribution is closely associated with the grouping by specificity. This association in column p of the alignment is measured by the mutual information
where is a residue type, is the ratio of the number of occurrences of residue in group i at position p to the length of the whole alignment column, is the frequency of residue in the whole alignment column, is the fraction of proteins belonging to group i. The mutual information reflects the statistical association between two discrete random variables and i.

To address the facts that frequencies are calculated based on a small sample, and that substitutions to amino acids with similar physical properties should be weakly penalized, the observed amino acid frequencies are modified. Instead of using , where is the number of occurrences of residue in group i, is the size of group i (here i is a single group or the whole alignment), SDPpred uses smoothed frequencies
where is the probability of amino acid substitution according to the matrix corresponding to the average identity in group i, is a smoothing parameter.

To calculate the statistical significance of the obtained values of Ip, each column is shuffled, yielding the distribution . To offset the background similarity of proteins that is higher within groups than between groups, we calculate the expected mutual information for the column p where a and b do not depend on the position, i.e. are the same for every position of the alignment , so that

L is the total length of the alignment, is the observed mutual information for the i-th column.

Then, Z-scores are calculated:

A high value of Z-scores indicates a position, where the amino acid distribution is much closer associated with grouping by specificity than for an average position of the alignment, and thus, which is likely to be an SDP.

Given a series of Z-scores corresponding to every position of the multiple alignment, one needs to evaluate the significance of the Z-scores in order to tell whether the observed Z-score is sufficiently high to indicate a SDP. SDPpred uses an automated procedure for setting the thresholds based on the computation of the Bernoulli estimator. The observed Z-scores are oredered by decrease: . The threshold is defined as:
where n is the total number of considered positions, , . positions having highest Z-scores are designated SDPs, as they are the least probable to constitute a tail of the Gaussian distribution, and thus are non-randomly generated positions.

back to home page