SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins

Home

Algorithm in details

Go!

Help


What is SDPpred?

Input format

Output format

Example


What is SDPpred?


SDPpred is a tool for prediction of residues in protein sequences that determine functional differences between proteins, having same general biochemical function.

Many protein families contain homologous proteins that have a common biological function, but different specificity towards substrates, ligands, effectors, DNA, proteins and other interacting molecules including other monomers of the same protein. All these interactions must be highly specific. Our aim is to find amino acid residues, which account for different specificity of proteins from one family, i.e. to distinguish amino acid substitutions caused by random evolutionary process from those caused by switch of specificity.

Amino acid residues that determine differences in protein functional specificity and account for correct recognition of interaction partners, are usually thought to correspond to those positions of a protein multiple alignment, where the distribution of amino acids is closely associated with grouping of proteins by specificity. SDPpred searches for positions that are well conserved within specificity groups but differ between them. These positions are called SDPs (specificity-determining positions). Such positions, though obvious in alignments containing a small number of proteins and specificity groups, become a challenge to find in large protein families with a variety of specificities.

The only information required for prediction of SDPs is a multiple alignment of protein sequences divided into specificity groups (more details on the input format here). SDPpred can analyze alignments of length up to 2000 positions, containing at most 1000 proteins. There can be up to 1000 specificity groups. However, it is recommended that each group would contain at least three sufficiently divergent sequences. On the other hand, the average identity in each group should not be less than 25%. Having more than two groups also strongly improves the quality of prediction due to more efficient elimination of the background evolutionary similarity.

SDPpred predicts a set of SDPs, maps them onto the multiple alignment of the protein family or onto a user-selected protein in this alignment (more detail on the output format here).

SDPpred
  • Does not use any information about the proteins' structure. The procedure is based solely on statistical analysis of an alignment, and thus it can be applied to protein families that do not include any members with resolved 3D structure.
  • Automatically calculates the number of SDPs and the probability of occurrence of these positions by chance (B-cutoff setting). It does not use any ad hoc cutoff and thus does not require any prior knowledge about special properties of the analyzed family.
  • Substitutions within specificity groups are weighted according to physical properties of amino acids, using a substitution matrix, so that substitutions to amino acids with similar properties are only weakly penalized.
  • Incorporates information about evolutionary distance within and between groups by using different amino acid substitution matrices.

back to top
back to home page


Input format


SDPpred supports the following input format.

The only information needed for prediction of SDPs is a multiple alignment of protein sequences divided into specificity groups. The aligned sequences should be in the FASTA, GDE, or Pfam plain text (in the latter case with gaps as dashes and all characters in upper case) alignment format. The alignment should be manually edited in order to define the specificity groups. They should be separated by lines beginning with the "equals" sign and containing name of the following group, e.g.

=Group1

Generally, the group name can be framed by any number of spaces and the "equals" signs, e.g. '=== Group1 ===' is also a valid header for the group named 'Group1'.

Thus the input alignment should look either like this:

=== RbsR ===
>EC_RbsR
-----MATMKDVARLAGVSTSTVSHVINKDRFVSEAITAKVEAAIKE
LNYAPSALARSLKLNQTHTIGMLITASTN-----PFYSELVRGVERS
>Pp_RbsR
-----MATIKDVAALAGISYTTVSHVLNKTRPVSEQVRLKVEAAIIE
LDYVPSAVARSLKARSTATIGLLVPNSVN-----PYFAELARGIEDA
>BS_RbsR
-----MATIKDVAGAAGVSVATVSRNLNDNGYVHEETRTRVIAAMAK
LNYYPNEVARSLYKRESRLIGLLLPDITN-----PFFPQLARGAEDE
=== GalR ===
>EC_GalR
-----MATIKDVARLAGVSVATVSRVINNSPKASEASRLAVHSAMES
LSYHPNANARALAQQTTETVGLVVGDVSD-----PFFGAMVKAVEQV
>RTFU01680
MERRRRPTLEMVAALAGVGRGTVSRVINGSDQVSPATREAVKRAIKE
LGYVPNRAARTLVTRRTDTVALVVSENNQKLFAEPFYAGIVLGVGVA

or this:

=== RbsR ===
%EC_RbsR
-----matmkdvarlagvststvshvinkdrfvseaitakveaaike
lnyapsalarslklnqthtigmlitastn-----pfyselvrgvers
%Pp_RbsR
-----matikdvaalagisyttvshvlnktrpvseqvrlkveaaiie
ldyvpsavarslkarstatigllvpnsvn-----pyfaelargieda
%BS_RbsR
-----matikdvagaagvsvatvsrnlndngyvheetrtrviaamak
lnyypnevarslykresrliglllpditn-----pffpqlargaede
=== GalR ===
%EC_GalR
-----matikdvarlagvsvatvsrvinnspkaseasrlavhsames
lsyhpnanaralaqqttetvglvvgdvsd-----pffgamvkaveqv
%ST_GalR
merrrrptlemvaalagvgrgtvsrvingsdqvspatreavkraike
lgyvpnraartlvtrrtdtvalvvsennqklfaepfyagivlgvgva

or this:

=== GLP ===
GLA_LACLC/1-284       --MDVTW--TVKYITEFVGTALLIIMGNGAVANVELKGTKA-
GLPF_HAEIN/1-249      --MDKSL--KANCIGEFLGTALLIFFGVG-CVAA-LKVAGA-
GLPF_ECOLI/1-251      MSQTSTL--KGQCIAEFLGTGLLIFFGVG-CVAA-LKVAGA-
PDUF_SALTY/1-249      --MNDSL--KAQCGAEFLGTGLFLFFGIG--CLSALKVAGA-
FPS1_YEAST/244-527    KWSSVKNTYLKEFLAEFMGTMVMIIFGSAVVCQVNVAGKIQQ
=== PIP ===
PI21_ARATH/31-266     ELKKWSF--YRAVIAEFVATLLFLYITVL--TVIGYKIQSD-
PIP1_ATRCA/32-261     ELKLWSF--WRAAIAEFIATLLFLYITVA--TVIGYKKETD-
PIP1_LYCES/45-274     ELSSWSF--YRAGIAEFMATFLFLYITIL--TVMGLKRSDS-
=== AQP ===
MIP_BOVIN/3-219       ELRSASF--WRAICAEFFASLFYVFFGLG--ASLRWAPGP--
AQP2_HUMAN/3-219      ELRSIAF--SRAVFAEFLATLLFVFFGLG--SALNWPQAL--
AQP1_HUMAN/4-227      EFKKKLF--WRAVVAEFLATTLFVFISIG--SALGFKYPVG-

The user should also select the number of shuffles for computation of the statistical significance (between 1 000 and 10 000). An alignment of a thousand of sequences divided into several hundreds of specificity groups is analyzed in a couple of hours if each column is shuffled 10 000 times. Using less shuffles reduces the required time proportionally, but makes the results less reliable. Typically, the top of the SDP list remains the same, but minor variations may appear near the cutoff.

The last parameter is the maximum allowed percentage of gaps in a column to be analyzed. Columns with a greater fraction of gaps are excluded from the analysis. Typically, this number should not exceed 30%, but if you are interested in finding, for instance, group-specific loops, it might be reasonable to set this parameter to a higher value. However, a large percent of allowed gaps produces many SDPs at the termini of the alignment, where it is likely to be incorrect.

The following conditions on the input alignment must be satisfied:
  • the alignment is shorter than 2000 positions;
  • the alignment contains at most 1000 proteins;
  • protein names are shorter than 40 characters;
  • group names are shorter than 15 characters;
  • there are at most 1000 specificity groups in the alignment.
back to top
back to home page


Output format


SDPpred outputs the set of SDPs, i.e. positions of the alignment, which are likely to determine differences in functional specificity between the provided groups. These positions exhibit amino acid distribution highly correlated with grouping by specificity.

The set of SDPs can be visualized in several ways:

  • As colored columns of the alignment. The alignment is preceeded by the information about the query, the number of predicted SDPs and the probability of obtaining this list of SDPs by chance (for details see algorithm description). The results are reliable if the probability is less than 10-10. If the probability is greater, it is likely that there are no SDPs at all or the specificity assignments are incorrect. In the pull-down menu below the alignment the user can select a protein, whose SDP residues will be listed below. The first protein of the alignment is selected by default.
  • As a list ordered by decrease of SDP significance, more exactly, Z-score (see algorithm description for details).
  • As a plot of the cutoff significance. By default, the cutoff is set as number of positions, corresponding to the global minimum. However, one may want to consider other local minima (particularly, in the case when there are several local minima of close significance) and the corresponding sets of SDPs (for detail on the cutoff setting see algorithm description). A new cutoff can be set by a click on the plot. One can then look at the alignment or the list, which would display the selected number of SDPs.

back to top
back to home page


An example


Here we provide an example of how SDPpred works.

Consider the MIP family of membrane channels, which includes 17 proteins, all from bacteria. These proteins are divided into two groups, the GLP group of proteins transporting mainly glycerol, and the AQP group of proteins transporting water. The input alignment looks like this:

The obtained set of SDPs consists of 10 positions. Here is the result page:

The amino acids listed below the alignment correspond to the first protein of the first group, namely __. By choosing another protein in the pull-down menu one get these numbers recalculated for the protein of one's choice.

If one chooses the "List of SDPs" option, one gets the list of SDPs, which looks like this:

The plot of probabilities for setting the cutoff for this alignment looks as follows:

The set of SDPs presented on previous screenshots has been formed by setting the cutoff at 10, which corresponds to the global minimum. However, the second minimum can be of interest as well. By clicking on it one sets a new cutoff. Then the result pages would change, for example the page displaying the list of SDPs would look as follows:

back to top
back to home page
Version 1.0

This is the first release of the program. If you observe a bug or suspicious behavior, or get a nonsense result, please send us a note, containing the query alignment and the parameters you've used. Thank you.