dict.md logo

SMART 5: domains in the context of genomes and networks

When the Simple Modular Architecture Research Tool (SMART) database was first made public 8 years ago (1), the current extent of completely sequenced genomes was little more than a dream. In the last few years, the astonishing successes of whole organism approaches to biology are not only limited to sequencing efforts but also include techniques, such as the high-throughput identification of protein–protein interactions, which have created new opportunities and higher expectations for computational approaches to interpreting biological sequences. In the last 2 years, we have been developing new ways of meeting these challenges.

The basic data of SMART are high-quality manually derived alignments of protein domain families. As hidden Markov models (2) these allow us to identify protein domains in sequence databases; these results are stored in a database accessible via a simple web interface (http://smart.embl.de). The data provide a framework for understanding the evolution and function of genes and proteins throughout the living world. Whereas the SMART philosophy has been to include essentially all available protein sequences, we recognize that many users are interested primarily in the biology of a particular organism. Accordingly, we have developed new views more tightly integrated with genome data. These new genome views allow further cross-referencing with protein–protein interaction maps, making SMART an invaluable tool for systems biologists to interpret pathways and networks.

Owing to the nature of our source databases (Swiss-Prot, SP-TrEMBL and Ensembl) (3,4) the protein database in SMART has significant redundancy, even though identical proteins are removed. Different proteins and fragments in the source databases often correspond to the same gene. Users exploring the various domain architectures or interested in domain counts in various genomes are particularly vulnerable to this problem, as the numbers they get are often inflated and unrealistic. To overcome this problem, we extended SMART with a new operating mode, namely ‘Genomic’ mode. The main difference between normal and genomic mode in SMART is the underlying protein database. In genomic mode, only the proteins from 170 completely sequenced genomes are included (a full list is available at http://smart.embl.de/smart/list_genomes.pl). Swiss-Prot (3) is our main source database of genomic data, together with Ensembl (4) for metazoan genomes. This database has minimal redundancy, and is therefore particularly useful for whole genome studies of domain architectures or single domain distributions.

To improve the function prediction for single domains, we annotated essential catalytic sites for all enzymatic domains in SMART. These were extracted from structural reports in the primary literature, wherever the catalytic mechanism was known (5). Now, protein sequences can be scanned for the presence of important catalytic amino acids (Figure 1). Absence of one of these amino acids very likely results in loss of catalytic activity. Recently, it turned out that many domains homologous to signaling enzymes seem to have lost their catalytic ability, although they are evolutionarily conserved. Instead of a catalytic function these domains appear to play a role in regulatory processes. This trend is especially obvious in the protein tyrosine phosphatase family (5). The inclusion of catalytic amino acid residues in the database will allow a more rapid identification of inactive enzyme homologs in the future.

As a further step from the single domain to the understanding of multi domain proteins, SMART now predicts the taxonomic class, where the concept of a protein, that is its domain architecture, was invented. The domain architecture is defined as the linear order of all SMART domains in the protein sequence. To derive the point of its invention, all proteins with the same domain architecture are mapped onto NCBIs taxonomy (6). The last common ancestor of all organisms containing at least one protein with the domain architecture is defined as the point of its origin. From the knowledge on the origin of domain architectures one might infer the distribution and presence of these architectures in not yet or incompletely sequenced genomes. In addition, conclusions on the general function of domain architectures can be drawn.

The latest version of SMART provides information about putative interaction partners for more than 350 000 proteins (Figure 2). This information is imported from the STRING database (7), in which known and predicted protein–protein associations are integrated from a variety of sources. The interactors are shown in SMART in the form of a summary graphic (network); the various types of interaction evidence are depicted as lines of different colors in the network. Clicking on the graphic will launch the STRING website, where the underlying evidence can be studied in detail. The interactions in STRING include physical binding interactions, as well as functional associations, such as membership in a common pathway or process. The data are derived from a variety of sources, including knowledge bases, such as BIND (8), KEGG (9), HPRD (10) and Reactome (11), as well as in silico prediction approaches and automated text-mining. STRING aims to improve usability of the interactome by scoring and ranking interaction data (making a confidence estimate on each prediction), as well as by transferring interaction knowledge between model organisms where applicable. SMART and STRING are both cross-referenced through a common set of proteins and genomes, and STRING in turn uses domain information from the SMART server in its pages as well.

The core of SMART is a relational database management system (RDBMS) which stores information on SMART domains (1,12). Owing to the exponentially increasing amount of data, many parts of the database access code have been updated or completely rewritten, resulting in greatly improved response times, most noticeably in the domain architecture analysis operations.

SMART database includes the information on domain presence in all proteins in a non-redundant database, now with the added data on the catalytic activity for 50 catalytic domains. All domain architecture analysis results include this information, and domains with missing essential amino acids are overlaid with the word ‘inactive’ (Figure 1). The domain annotation page provides detailed information on which of the required amino acids are missing, and gives pointers to the relevant literature.

DisEMBL [http://dis.embl.de, (13)] predictions of intrinsic protein disorder were included into SMART's analysis methods. DisEMBL is a computational tool for the prediction of disordered/unstructured regions within a protein sequence. Predictions included in SMART are based on missing coordinates in X-ray structure as defined by REMARK465 entries in PDB and the ‘Hot loopsmethod. Hot loops constitute a refined subset of the standard loops/coils as defined by DSSP (14), namely, those loops with a high degree of mobility as determined from C-α temperature factors (B-factors).

SMART's user interface was completely rewritten and is now fully compliant with the latest web standards, such as XHTML1.0 and CSS2. Users with standards-compliant web browsers can fully enjoy the extra speed and features. Owing to increasing server load, the queuing system was completely rewritten and the hardware greatly expanded resulting in a more stable operation and faster response times.

An important new feature is the introduction of taxonomic trees into SMART. Two primary uses for taxonomic trees in SMART are the grouping of domain architecture query results and the detailed taxonomic distribution of domains now shown on domain annotation pages (Figure 3). The grouping of architecture query results allows users to easily display only proteins from certain species or taxonomic nodes. Taxonomic distribution of proteins on domain annotation pages gives a detailed overview of domain presence in different species and taxa.