dict.md logo

Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters

With sequencing projects producing large amounts of data lacking functional characterization, there is an increasing need for automated sequence annotation procedures. We created the CluSTr (Clusters of SWISS-PROT+TrEMBL proteins) database (1), a resource for an automatic classification of SWISS-PROT+TrEMBL (2) proteins into groups of related sequences. The clustering is based on analysis of all pair-wise sequence comparisons between proteins using the Smith–Waterman algorithm (3). A Monte-Carlo simulation, resulting in a Z-score (4), is used to estimate the statistical significance of raw Smith–Waterman scores between potentially related proteins. Clustering is carried out at different levels of protein similarity, yielding a hierarchical organization of the protein groups.

Currently data for more than 70 completely sequenced proteomes, including the eukaryotic proteomes of Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus and Saccharomyces cerevisiae and more than 65 prokaryotic ones, is represented in the database. The complete list of available proteomes is provided on the documentation page (http://www.ebi.ac.uk/clustr/documentation.html).

Web access to the data stored in a relational database (ORACLE) is provided using Java servlet technology. The CluSTr database is available for querying and browsing from http://www.ebi.ac.uk/clustr.

Two types of search forms are provided: a ‘simple search’ that queries directly the CluSTr data in Oracle and an ‘advanced search’ that allows free-text querying via the EBI SRS (Sequence Retrieval System) server (5). The result page contains the description of the requested cluster, a list of the grouped proteins with SWISS-PROT+TrEMBL description and the InterPro (6) based information on domain/family signatures represented in the cluster. A number of links are provided for further data analysis. These links allow to download entries in the clusters of interest, to look at the graphical representation of known functional signatures, to inspect/edit corresponding multiple alignments, to see the resolved structural domains, and to create dynamic SRS links to other biological databases.

SRS allows retrieving information for clustered proteins from other databases using indexed links. The SWISS-PROT+TrEMBL accession numbers stored in the CluSTr database are used to access the corresponding proteins, which SRS allows to download in various formats. The retrieved SWISS-PROT+TrEMBL records could be linked further inside SRS to other databases. For example, on the basis of information from OMIM (7) (a catalogue of human genes and genetic disorders), it is possible to see whether proteins from a cluster are associated with a disease.

The CluSTr interface is enriched by information on the underlining domain architecture via InterPro resource. For each cluster an InterPro section provides the summary of the domain content, showing the homogeneity of a group in terms of represented domains. Visual representation of this data is provided through the InterPro graphical interface. The analysis of a cluster domain composition is even more apparent with the condensed graphical view, which shows a single representative for clustered proteins with exactly the same domain architecture.

The DisplayFam multiple alignment browser has been recently included to visualize summarized multiple alignments based on sequence divergence trees and consensus sequences. For the users who want to edit multiple alignments of the clustered proteins, we integrated Jalview (8). This tool has many useful colouring schemes, which highlight different features of multiple alignments. For example, it is possible to colour amino acids according to their biochemical properties, to identify most conservative residue columns and to see secondary structure predictions.

Structural information provides an important insight into the understanding of protein functions. For each cluster, the list of secondary structure cross-references to the Homology derived Secondary Structure of Proteins (HSSP) database (9) is generated dynamically. The database also provides links to the Protein Data Bank (PDB) resource (10), the archive of structural data of biological macromolecules.

An automated procedure has been developed to update CluSTr data incrementally in a synchronized manner with weekly releases of SWISS-PROT+TrEMBL. Additional ORACLE tables are used to facilitate the update procedure, which identifies new, updated, unchanged and deleted proteins using SWISS-PROT+TrEMBL accession numbers and the circular redundancy checksum (CRC64) of sequences. The list of new and changed proteins is used to calculate the similarity between proteins in this set and against proteins in the unchanged set. Clusters are built for the updated similarity matrix on different levels of Z-scores using a single linkage algorithm.

An automatic procedure for tracing cluster identifiers between releases was developed based on the MatchDom algorithm (11). The algorithm transfers the cluster identifiers from a reference cluster set to a target set, looking for cluster overlaps between two successive releases of CluSTr. The matching clusters are sorted and the best overlapping cluster inherits the identifier of a reference cluster. The other clusters get new identifiers. The procedure allows maintaining stable cluster identifiers corresponding to particular protein families.

Apart from its use as an interactive web resourse, CluSTr has been applied to the analysis of complete proteomes http://www.ebi.ac.uk/proteome (12). The CluSTr section at the Proteome Analysis database provides a general description of homologous protein groups in completely sequenced genomes as well as a list of candidates with novel sequence domains. The developed methodology has also been used for the in depth study of protein families (13).

CluSTr is an evolving resource. New developments include an improved web interface, an automated update procedure and better coverage of SWISS-PROT+TrEMBL data, which will allow us to use the resource in the TrEMBL automatic annotation routine, known as a RuleBase (14).