dict.md logo
Advertisement:
Advertisement:

GRSDB: a database of quadruplex forming G-rich sequences in alternatively processed mammalian pre-mRNA sequences

Guanine-rich nucleic acids are known to form highly stable G-quadruplex structures, also known as G-quartets. Recently, there has been a tremendous amount of interest in studying G-quadruplexes owing to the realization of their biological importance. G-rich sequences (GRSs) capable of forming G-quadruplexes are found in the vicinity of polyadenylation regions and are involved in regulating 3′ end processing of mammalian pre-mRNAs. G-rich motifs are also known to play an important role in alternative, tissue-specific splicing by interacting with hnRNP H protein subfamily. Whether quadruplex structure directly plays a role in regulating RNA processing events requires further investigation. To date there has not been a comprehensive effort to study G-quadruplexes near RNA processing sites. We have applied a computational approach to map putative Quadruplex forming GRSs within the transcribed regions of a large number of alternatively processed human and mouse gene sequences that were obtained as fully annotated entries from GenBank and RefSeq. We have used the computed data to build the GRSDB database that provides a unique avenue for studying G-quadruplexes in the context of RNA processing sites. GRSDB website offers visual comparison of G-quadruplex distribution patterns among all the alternative RNA products of a gene with the help of dynamic graphics. At present, GRSDB contains data from 1310 human and mouse genes, of which 1188 are alternatively processed. It has a total of 379 223 predicted G-quadruplexes, of which 54 252 are near RNA processing sites. GRSDB is a good resource for researchers interested in investigating the functional relevance of G-quadruplexes, especially in the context of alternative RNA processing. It can be accessed at http://bioinformatics.ramapo.edu/grsdb/.

Guanine-rich nucleic acids are known to form higher order structures. Their ability to form highly stable quadruplex structures was discovered more than four decades ago (1). The G-quadruplex structure, also known as a G-quartet, is composed of stacked G-tetrads, which are square co-planar arrays of four guanine bases each. Cyclic Hoogsteen hydrogen bonding between the four guanines within each tetrad renders a high level of stability to the quadruplex (Figure 1). Although structures with three or more G-tetrads are considered to be more stable, many nucleotide sequences are known to form quadruplexes with two G-tetrads (2,3). G-quadruplexes may be formed by repeated folding of a single nucleic acid molecule (unimolecular G-quadruplex) or by interaction of two or four strands. The former is more likely to be encountered in physiological conditions (4,5). (The present work focuses only on the unimolecular quadruplexes.) Formation of G-quadruplexes in vivo is facilitated by proteins (6). Some proteins are also implicated in resolving the G-quadruplex structure (7,8).

G-quadruplex sequence motifs have been reported in telomeric, promoter and other regions of mammalian genomes. Formation of a G-quadruplex in the promoter region has been associated with transcription regulation of the c-myc oncogene and is being considered as a potential target for therapeutic purposes (9,10). Owing to the realization of their biological importance, recently, there has been a tremendous amount of interest in studying G-quadruplexes. This is evident from a surge in the published literature. [for reviews see (8,11)].

Although initially most of the studies focused on G-quadruplexes in the DNA, lately there have been many efforts to study G-quadruplex forming RNA (1216). In fact, G- rich sequences capable of forming G-quadruplexes in the RNA have been implicated in a variety of important biological activities, such as mRNA turnover (6), Fragile X Mental Retardation Protein (FMRP) binding (14), translation initiation (15) as well as repression (16).

We have previously shown that a conserved auxiliary G-rich sequence (GRS) found near the polyadenylation regions can mediate efficient 3′ end processing of mammalian pre-mRNAs (17,18) by interacting with DSEF1/hnRNP H/H′ protein (19). However, hnRNP F has been shown to be a negative regulator of 3′ end processing (20). Regulated polyadenylation is an important component of differential gene expression. More than 50% of human and 32% of mouse genes are known to have alternative polyadenylation (21). An interplay among GRS-binding proteins, hnRNP H/H′ and F, helps in regulating alternative polyadenylation of immunoglobulin pre-mRNA (20) which, combined with alternative splicing, plays an important role in mouse B lymphocyte development (22).

In addition to differential gene expression, alternative splicing affects disease processes (23) and is a major source of protein diversity. More than two-thirds of human genes are thought to undergo alternative splicing (24). Members of the hnRNP H protein subfamily, that bind G-rich motifs, are known to be involved in alternative, tissue-specific, regulated splicing events (2527). GRS motifs that are present near splice sites act as splicing regulators by interacting with hnRNP H (28). For example, binding of hnRNP H and F to G-rich tracts near 5′ splice site favors production of alternative pro-apoptotic Bcl-xs product (29). The regulatory G-rich motifs may be capable of forming quadruplex structures. Whether quadruplex structure directly plays a role in regulating RNA processing events requires investigation.

The majority of the mammalian poly(A) region GRS sequences that we had surveyed in our previous studies (18,19) are capable of forming unimolecular G-quadruplexes. Our preliminary analysis of ∼100 alternatively processed human transcripts has also revealed the presence of quadruplex forming sequences near alternative splice sites (30). However, a more detailed investigation into the distribution of G-quadruplex sequences near RNA processing sites requires a systematic large-scale analysis of mammalian genes. Although, there have been two recent surveys of quadruplexes in the human genome (31,32), to date there has not been a comprehensive effort to study G-quadruplexes near RNA processing sites.

We have used a computational approach (30) to map putative G-quadruplex forming sequences within the transcribed regions of a large number of alternatively processed human and mouse genes. The fully annotated genomic nucleotide sequences are obtained from NCBI-based GenBank and RefSeq for computational analysis. Based on our analysis of alternatively spliced and alternatively polyadenylated human and mouse genes, we have built the GRSDB database. GRSDB provides a unique avenue for studying G-quadruplex forming sequences in the context of RNA processing sites. In addition to providing data on composition and locations of mapped quadruplexes relative to the processing sites in the pre-mRNA sequence, GRSDB offers simultaneous visual comparison of G-quadruplex distribution patterns among all the alternative RNA products of individual genes with the help of dynamically generated graphics.

Researchers interested in investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing, will find GRSDB to be of great value. It allows a comprehensive large-scale analysis as well as detailed studies in individual genes. GRSDB is also a good resource for performing large-scale analysis of G-quadruplex sequence composition, including study of loops, in the transcribed regions.

The basic unit of study in GRSDB is the putative G-quadruplex that we have called QGRS (Quadruplex forming GRS). These sequences follow the motif GxNaGxNbGxNcGx. Here Gx refers to the group of guanines (which we will refer to as a G-group) that form a complex of x stacked G-tetrads. In the individual gene entries stored in GRSDB, x is generally 2, 3 or 4. The intervening arbitrary bases, Na, Nb and Nc, are called gaps or loops.

Two sequences are said to be overlapping if their positions in the nucleotide sequence do overlap. The default action of GRSDB is to display non-overlapping sequences, but the user can display all QGRS.

GRSDB is a relational database built using MySQL. The GRSDB website can be accessed at http://bioinformatics.ramapo.edu/grsdb/. This database primarily stores information about putative G-quadruplex sequences (QGRS) for genes that are alternatively processed (either alternatively spliced or alternatively polyadenylated). GRSDB is structured to facilitate queries about alternatively processed genes and to display information on the G-quadruplex sequences contained in the transcribed regions of the gene and their locations relative to RNA processing sites. Table 1 shows the types of objects found in the database.

GRSDB is populated using an auxiliary program, QGRS-Mapper, that is based on previously published methods (30) and was developed using BioPerl. Once appropriate genes have been identified, this program links to GenBank or RefSeq, downloads the corresponding genomic nucleotide sequence entry of the gene, and parses the entry for product, intron, exon, poly(A) and related information. The program then processes the nucleotide sequence to find all QGRS and map their location within the gene and their distance from relevant RNA processing sites.

A scoring method is applied to each QGRS. The computed score, called a G-score (30), is formulated to reward sequences with smaller, more even gaps between the G-groups in addition to larger G-group size, thereby favoring the arrangement that is more likely to form a unimolecular complex. This choice of scoring system is in agreement with the existing literature on loop structures in G-quadruplexes (3135). In particular, the data gathered in this research points to loop sizes tending to be small and preferentially equal or nearly equal.

The data flow for GRSDB is summarized in Table 2. After the gene information is downloaded from NCBI, parsed, processed for QGRS, and scored, it is then uploaded into GRSDB. At this point the database is ready for user queries. There are three different interfaces provided for viewing database contents: the gene view, the data view and the graphical view.

Database users are given a variety of options in formulating a query, including searching for genes that are alternatively spliced or alternatively polyadenylated. Once a query has been entered, a table is displayed of all genes satisfying the query. Information for individual genes is displayed in a table as shown in Figure 2 for the particular gene MUCDHL. This is what we call the gene view. One can see that MUCDHL is both alternatively spliced and alternatively polyadenylated.

At this point the user can choose to analyze one of the products or all products simultaneously. There are two types of analysis possible, the data view or graphical view. Figure 3 represents the data view analysis for Product 1 of MUCDHL, showing all non-overlapping QGRS. The table shows the location of each QGRS, its distance from the nearest splice site, and its G-score.

Alternatively, the user can select the graphical view of any product (or again, all products together), which is shown in Figure 4. A visual model of the product is displayed, showing the location of exons, introns and untranscribed regions. Further, the location of each QGRS is indicated by a vertical line. The length of the line is proportional to the G-score for the sequence.

GRSDB provides curated information on composition and distribution of putative QGRSs in the transcribed regions of alternatively processed human and mouse genes. The data are based on the analysis of fully annotated GenBank/RefSeq human and mouse genomic nucleotide entries that exhibit alternative processing information. Although the NCBI databases contain a large number of mRNA sequence records, at present the number of genomic entries that will provide information needed for our studies is limited.

At present, our database contains information obtained from 1310 human and mouse genes, of which 1188 are alternatively processed. A total of 30 584 introns and 33 816 exons were analyzed, containing a total of 3231 RNA products. These products taken together contain a total of 379 223 putative G-quadruplexes, of which 54 252 are near RNA processing sites [within 120 nt of a splice site or a poly(A) signal]. Note that while GRSDB currently contains data only on human and mouse genes, our computational tools and the database are designed to include other organisms as well.

GRSDB is continuously being updated with new data entries. The database is structured to facilitate a wide variety of queries and to map G-quadruplex sequences relative to the RNA processing sites in both data and graphic formats. The user friendly interface allows comparisons of all the alternative RNA products of individual genes on the same screen. GRSDB is a good resource for researchers interested in investigating the functional relevance of G-quadruplexes, especially in the context of alternative RNA processing.

We are using the database to conduct detailed bioinformatics studies on the distribution patterns of QGRS near RNA processing sites. In particular, we are investigating whether there is a correlation between the distribution pattern of QGRS and alternative processing. Our group is also studying the loop composition of these sequences.

Supplementary Data are available at NAR online.