|SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins|
Algorithm in details
a multiple protein sequence alignment. The proteins are divided into N
specificity groups, numbered by i=1,...,N. The goal in to identify columns (positions) in the
alignment, in which the amino acid distribution is closely associated
with the grouping by specificity. This association in
column p of the alignment is measured by the mutual information
To address the facts that frequencies
are calculated based on a small sample, and that substitutions to amino
acids with similar physical properties should be weakly penalized, the observed amino acid frequencies are modified.
Instead of using , where
is the number of occurrences of residue
in group i, is the
size of group i (here i is a single group or the whole alignment),
SDPpred uses smoothed frequencies
To calculate the statistical
significance of the obtained values of Ip,
each column is shuffled, yielding the distribution .
To offset the background similarity of proteins that is higher within groups than
between groups, we calculate the expected
mutual information for the column p where a
and b do not depend on the position, i.e. are the same for every
position of the alignment , so that
Given a series of Z-scores corresponding
to every position of the multiple alignment, one needs to evaluate the significance of
the Z-scores in order to tell whether the observed Z-score is sufficiently
high to indicate a SDP. SDPpred uses an automated procedure for setting the
thresholds based on the computation of the Bernoulli estimator.
The observed Z-scores are oredered by decrease: .
The threshold is defined as: