dict.md logo

Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design

The prediction of functional sites in newly solved protein structures is a challenge for computational structural biology. Most methods for approaching this problem use evolutionary conservation as the primary indicator of the location of functional sites. However, sequence conservation reflects not only evolutionary selection at functional sites to maintain protein function, but also selection throughout the protein to maintain the stability of the folded state. To disentangle sequence conservation due to protein functional constraints from sequence conservation due to protein structural constraints, we use all atom computational protein design methodology to predict sequence profiles expected under solely structural constraints, and to compute the free energy difference between the naturally occurring amino acid and the lowest free energy amino acid at each position. We show that functional sites are more likely than non-functional sites to have computed sequence profiles which differ significantly from the naturally occurring sequence profiles and to have residues with sub-optimal free energies, and that incorporation of these two measures improves sequence based prediction of protein functional sites. The combined sequence and structure based functional site prediction method has been implemented in a publicly available web server.

The prediction of functional sites in newly solved protein structures is an important challenge for structural genomics in which protein structures are determined without knowledge of the function. Most methods for functional site identification utilize measures of amino acid sequence conservation in homologous sequences (14), based on the assumption that functional sites are relatively conserved during evolution. Protein structural information has also been used to help identify protein functional sites (59).

A problem with sequence based methods for predicting functional sites is that residues may be conserved due to structural constraints which can confound the accurate prediction of functional sites (10). Fortunately, there has been considerable progress in recent years in the development of methods for identifying structural constraints on protein sequences. Computational methods have been developed for estimating the free energy changes upon amino acid substitutions using simple physically based potential functions (11,12), which have allowed the pinpointing of critical residues at protein–protein interfaces (12). These energy functions have also been tested in computational protein design (13,14) of proteins with novel structures (15) and functions (16). In the context of this paper, the important aspect of these methods is that they make possible the analysis of structural constraints on protein evolution independent of consideration of protein function.

In this paper, we describe a protein functional site prediction method which distinguishes functional and structural constraints on protein evolution using all atom computational protein design methodology (12,13). Proteins evolved under selective pressure to both maintain the stability of the overall structure and biochemical function. In principle, if we can separate the structure-based selective pressure from function-based selective pressure, we should be able to distinguish the functionally important residues from the structurally important residues. Earlier computational work along these lines includes the use of a low resolution protein design method in which protein side chain interactions are described using the Miyazawa–Jernigan contact potential to identify those aspects of evolutionary conservation which are likely to reflect structural constraints (17), the use of the change in electrostatic energy (18) upon in silico mutation to identify functional sites, and the use of measures of structural fitness and conservation to identify sites potentially conserved due to structural constraints (19). Complementary experimental studies have shown that mutations of functional studies can actually increase protein stability (20,21). Here, we build on these ideas, and use the Rosetta all atom computational protein design (13) and free energy calculation (12) methodologies to obtain measurements of the structural constraints in a protein structure, and show that combining these measures with sequence conservation improves prediction of protein functional sites. The algorithm has been implemented in a publicly available web server at http://tools.bakerlab.org/~gcheng/.

We used an enzyme active site set compiled in Janet Thornton's group (22) described previously. Duplicated entries were deleted and proteins with few known homologous sequences (<20) were removed from the test set. For testing the discriminatory power of our method, this set was further divided into a training set (78 proteins) and a test set (314 proteins).

Homologous sequences were gathered using five rounds of PSI-BLAST against a 90% non-redundant protein database (23), with an E-value cut-off of 1E-10. MUSCLE (24) was used to align the sequences. Sequences with <80% of the full length of the original sequences were removed from the alignment.

Sequence conservation scores were evaluated with the SCORECONS (4) method using the multiple sequence alignment from MUSCLE as input.

The Rosetta design program was used to generate sequences predicted to be stable for each of the test proteins. Forty protein sequences were generated for each structure using Rosetta Design (9) and the PSI-blast software package was used to generate a position specific scoring matrix (PSSM) for the designed sequences as well as naturally occurring sequences. The Euclidian distance between the two PSSM's was computed for each residue, and rescaled within the range of 0–1 to provide comparable results across different proteins, with 0 corresponding to high similarity and 1 to low similarity.

The Rosetta ΔΔG module was used to calculate the free energy changes accompanying substitutions for each of the 20 amino acids at each position in the protein. The weight on the solvation term was increased 2-fold to capture a large fraction of the buried polar interactions which are frequently involved in function. Accurately estimating the tradeoff between the loss of solvation free energy and the formation of attractive polar interactions accompanying the burial of polar groups remains a challenge for computational modeling. The energy gap between the native amino acid and the lowest energy amino acid at each position was then determined.

Logistic regression with the generalized linear model module of R was used to determine weights on the sequence conservation, natural/designed sequence profile difference and natural/optimal residue free energy gap which optimize the separation between functional and non-functional residues.

The WWW server was built using a combination of javascript, php, perl and python. The final results are provided in three different forms. First, 32 pictures from different angles are generated using pymol (25). Second, the temperature columns in the PDB file are replaced with the combined sequence-structure score. The modified PDB files are downloadable by clicking on any of the images. Third, the values of the three independent measures and the combined score for each residue are listed in a separate Table.

We use the Rosetta computational protein design and ΔΔG calculation methods together with evolutionary sequence information to obtain two related but distinct measures of the extent of structural versus functional constraints at each residue position in a protein structure. Rosetta design (13) generates sequences that are low in free energy for a specified protein structure independent of any functional constraints. We compare the sequences of the naturally occurring homologues of the protein under study with the designed sequences. The differences between these two sets of sequences reflect the functional pressures on the protein family evolution as these contribute to the natural sequence profiles but not the computed profiles. Our first measure of functional constraints is thus the deviation between the naturally occurring and computed sequence profiles. For the second measure, we use the Rosetta ΔΔG calculation method (12) to estimate the free energy gap between the naturally occurring amino acid and the energetically most favorable amino acid at the same position. This energy gap should reflect the extent of functional pressure exerted on the site: if there is strong selection for protein function, e.g. a critical role in enzyme catalysis, the naturally occurring residue may be far from optimal for protein stability (20).

Our goal is to combine these two measures with standard sequence conservation methods to improve protein functional site identification. For sequence conservation calculation, we use the recently described method, SCORECONS (4,26), which takes into account the alignment gaps, amino acid stereochemical properties and sequence weighing. A large data set of enzyme active sites compiled by Porter et al. (22) is used for testing our method.

In the following sections, first we evaluate the extent to which the sequence conservation measure and the two energy based measures described above can independently distinguish between functional and non-functional residues. We then describe the improvement of the accuracy of functional residue prediction using a combination of all three measures.

Figure 1 compares the distribution of sequence conservation scores for the functional residue sites to that for all the residues in the enzyme active site set (22). The histogram of sequence conservation scores for all residues peaks at a score value of 0.5. The functional sites are generally more conserved than non-functional sites, as indicated by the shift of the score distribution peak towards a higher value. The overlap between these two distributions is significant, possibly due to confounding factors such as the conservation of structurally important residues and sequence alignment errors. As this significant overlap of the two distributions makes it difficult to confidently identify functionally important sites based on sequence conservation alone, we need additional measures to further separate the functional sites from non-functional sites.

Next we compare the distribution of the differences between the naturally sequence profiles and the designed sequence profiles for functional residues to that for all residues. As noted above, our assumption is that the sites under strong functional selection pressure may have distorted amino acid distributions from those of the designed sequences, which were only subjected to selection for stability. Figure 2 shows that, indeed, for functional residue sites the geometric distances between designed sequence profiles and naturally occurring sequence profiles (for details see Materials and Methods) are generally larger than those for all the residue sites. This measure provides a new dimension to separate functional sites from non-functional sites.

Next we compare the distribution of the energy differences between the naturally occurring amino acid and the energetically most favorable amino acid at a given sequence position. (see Materials and Methods). In the distribution for all residues (Figure 3A), nearly half of the sites have energy differences below 1.0 kcal/mol. In contrast, in the distribution for functional residue sites (Figure 3B), the majority of the sites have energy differences of >1.0 kcal/mol. This observation indicates that a large fraction of functional sites appear to be suboptimal for stability. The energy gap thus provides another dimension for separating functional constraints from structural constraints.

Now we are ready to combine the above three sources of information—the sequence conservation scores, the differences between the naturally-occurring and designed sequence profiles and the energy gap between native and energetically most favorable residue, to better separate functional sites from non-functional sites. Logistic regression is used to achieve the optimal combination of these measures. To investigate whether or not these different measures are independent, we calculate the residual deviance in the accuracy of functional site prediction for each combination of the three information sources (the lower the residual deviance and, the more accurate the prediction). As summarized in Supplementary Table 1, all three measures make significant and independent contributions to the reduction in residual deviance. The reduction is greater than expected given the increase in the number of parameters, indicating that none of the information sources is redundant. We achieve the best model for identifying functional sites by a linear combination of all three measures, with the weights summarized in Supplementary Table 2.

To test the discriminating power of the combined model, we randomly chose 20% of the proteins in the enzyme active site set for training the model parameters and use the remaining 80% of the proteins as the test set for functional site prediction. The results are summarized in the ROC plot in Figure 4. It is evident that the combination of the energy gap information with the sequence conservation information improves the discrimination of functional and non-functional sites. The further improvement upon addition of the profile difference measure is relatively small. In the tests below and on the web server which implements the method, the linear combination of the three measures is used.

Figures 5 and 6 show structures of Arginine kinase and chymosin B colored according to sequence conservation scores (Figures 5A and 6A) and our combined measures (Figures 5B and 6B). In Figure 6A, the catalytic residues (plotted in space-fill view) are not among the most conserved residues. In Figure 5A, the functional residues are among the most conserved residues, but there are several residues in the hydrophobic core of the beta barrel with similar conservation scores, preventing the confident prediction of functional sites. As shown in Figures 5B and 6B, our combined measure increases prediction accuracy by reducing the number of false positives. In Figure 6B, the largest high scoring cluster correctly overlaps with the functional residues. Similarly, the conserved hydrophobic core residues in Figure 5B have lower scores than in Figure 5A, reducing the number of false positives.

To test the generality of our method, we applied our functional prediction algorithm to an independent functional site set compiled by the Lovell group (19) which includes ligand binding sites in addition to enzyme active sites. As indicated in the ROC plots in Figure 7, the results closely parallel the results on the Thornton group test set with the combined sequence and structure method performing significantly better than SCORECONS (the two sets have a modest degree of overlap: 49 PDB codes are shared between the 490 proteins in the Thornton set and the 243 proteins in the Lovell set). Chelliah et al. (19) reported instead of a ROC plot the number of true positives and false positives for a particular threshold with their structure based method. As indicated by the star in Figure 7, the performance of their method on this set is better than SCORECONS but not as good as our combined sequence and structure based method.

We also compared our method with the protein design based method by Pei et al. (17) on their test set. The results are summarized in Table 1, and a modest improvement in prediction accuracy is also observed.

Our work shows that the combination of sequence conservation information with structural information from all-atom protein design and free energy calculation improves protein functional site identification. Previous work has explored ways to improve the accuracy of functional site prediction by combining sequence and structural information (27,28). For example, Pei et al. (17) used a low resolution protein sequence design algorithm to predict protein ligand binding sites based on the differences between designed sequences and naturally occurring sequences. Wang et al. (10) proposed a method to separate the contribution of individual residue to structure and to function, and used it for protein function classification. Chelliah et al. (19) developed a method to distinguish structural constraints from general evolutionary constraints using both homologous sequences and homologous structures. Our approach goes beyond these earlier studies by using a physically realistic atomic level description of the protein, which has been validated by successful protein design efforts. Furthermore, our method combines multiple sources of information and achieves better results than any single source of information. Pei et al. (17) used the sequence profile differences between naturally occurring proteins and designed proteins, but did not use the sequence conservation score explicitly. Chelliah et al. (19) based their prediction on geometry comparisons, and did not use structure based energetic information.

There are a number of sources of errors in our predictions. Our approach, like the SCORECONS method on which it is based, is sensitive to the quality of the input multiple sequence alignments. Many functional sites are located in loop regions of protein structures, which can be problematic for sequence alignment algorithms. The structure based measures can in some cases be misleading, since functionally important residues can also contribute to stability. The method could be improved using better multiple sequence alignment algorithms, by spatial clustering of the high scoring residues (28) and by introducing backbone flexibility in the energy calculations. For proteins which have no detectable sequence homologs (singletons) (5,18), our method will work poorly because the structure based methods do not provide strong discriminations on their own (Figures 2 and 3) and the method is also clearly not applicable when the three dimensional structure of the protein is not known. However, with the rapid growth in the number of genomes sequenced and in the number of high resolution protein structures, automated functional site identification methods should play an increasingly important role in guiding experimental studies.

A web server which implements the method described in this paper is available at http://tools.bakerlab.org/~gcheng/.

Supplementary Data are available at NAR Online.