dict.md logo

A novel approach to describe a U1 snRNA binding site

RNA duplex formation between U1 snRNA and a splice donor (SD) site can protect pre-mRNA from degradation prior to splicing and initiates formation of the spliceosome. This process was monitored, using sub-genomic HIV-1 expression vectors, by expression analysis of the glycoprotein env, whose formation critically depends on functional SD4. We systematically derived a hydrogen bond model for the complementarity between the free 5′ end of U1 snRNA and 5′ splice sites and numerous mutations following transient transfection of HeLa-T4+ cells with 5′ splice site mutated vectors. The resulting model takes into account number, interdependence and neighborhood relationships of predicted hydrogen bond formation in a region spanning the three most 3′ base pairs of the exon (–3 to –1) and the eight most 5′ base pairs of the intron (+1 to +8). The model is represented by an algorithm classifying U1 snRNA binding sites which can or cannot functionally substitute SD4 with respect to Rev-mediated env expression. In a data set of 5′ splice site mutations of the human ATM gene we found a significant correlation between the algorithmic classification and exon skipping (P = 0.018, χ2-test), showing that the applicability of the proposed model reaches far beyond HIV-1 splicing. However, the algorithmic classification must not be taken as an absolute measure of SD usage as it may be modified by upstream sequence elements. Upstream to SD4 we identified a fragment supporting ASF/SF2 binding. Mutating GAR nucleotide repeats within this site decreased the SD4-dependent Rev-mediated env expression, which could be balanced simply by artificially increasing the complementarity of SD4.

Within eukaryotic cells, the formation of an RNA duplex between the free 5′ end of U1 snRNA and its target sequence in pre-mRNA stimulates the assembly of the spliceosome that catalyzes splicing. Prior to intron removal, binding of U1 snRNP, at least in some instances, protects pre-mRNA against nuclear degradation (13). Recently, it has been shown that functional 5′ splice sites from randomized intron sequences can be selected through repetitive rounds of in vitro splicing in HeLa cell nuclear extracts containing U1 snRNA with a deleted 5′ end (4). This finding does not controvert the original observation that base pairing between U1 and the 5′ splice site is necessary for the splicing of mRNA precursors in vivo (5). Rather, protein interactions between U1 snRNP and the 5′ splice site may additionally contribute to splice donor (SD) recognition (68). Consistent with this concept is the observation upon compiling sequences of exon–intron borders of mammalian 5′ splice sites (912) that metazoan splice sites are only poorly conserved. Nevertheless, a 5′ splice site consensus sequence can be derived from the highest frequency with which distinct nucleotides occur in a given position in such data set, viz. AG/GURAGU (where R = purine, and / indicates the exon–intron border) (13). This sequence is complementary to the nucleotides from positions 3 to 10 of the 5′ end of U1 snRNA. However, the significance of such a consensus sequence remains arguable. Numerous approaches that have been suggested for the prediction of 5′ splice sites and that are exclusively represented by nucleotide frequency-based consensus matrices (11,1416), fail to be highly reliable. An improvement for prediction of 5′ splice sites has been achieved by considering an influence of neighboring bases (1721) and the consideration that nucleotides of different site positions are mutually dependent (22).

Beyond the nucleotide sequence of the target, binding of U1 snRNP is facilitated by protein components. These include the U1-specific proteins, U1 C (23, 24) and U1 70K (24,25). The latter is bound to loop I of U1 snRNA (26) by its arginine/serine-like domain (27). Other proteins of the SR protein family (serine/arginine-rich proteins) are also involved in 5′ splice site selection, and in the commitment of different pre-mRNAs to the splicing pathway (28). Characteristic for the SR protein family are their RNA binding domain(s) in the N-terminus conferring substrate specificity, and their arginine/serine-rich (RS) domain at the C-terminus, which has been shown to be interchangeable for some SR proteins without affecting substrate specificity (29).

SR protein-dependent splicing is determined by binding of SR proteins to distinct sequence elements, known as exonic or intronic splicing enhancers (ESEs or ISEs, respectively) and exonic or intronic splicing silencers (ESSs or ISSs, respectively), within the upstream (30) or downstream exon (29) or within the intron (31). Moreover, depending on their position on the pre-mRNA, SR proteins can function as activators or repressors of splicing (32). Based on SELEX protocols, short consensus binding sites of 6–10 nt in length were obtained for more than half of the known human SR proteins (33). However, similarity of any sequence to such a consensus motif does not necessarily imply its functional significance (34). Our currently insufficient understanding of the context dependency of ESEs’ and ESSs’ activities (29,35), in particular in relation to the strength of the neighboring splice site (36), complicates any reliable description of and prediction for cognate splice sites and for the binding of U1 snRNP and U2AF to the 5′ and 3′ splice sites, respectively.

In a recent mutational analysis of the HIV-1 5′ splice site SD4 we have shown that a continuous stretch of at least 14 hydrogen bonds mediated envelope glycoprotein (env) mRNA stabilization resulting in unimpaired env expression (3). Mismatches to the U1 snRNA sequence did not decrease env expression, if flanked by a sufficient number of hydrogen bonds.

As an extension of this previous work, we here propose a hydrogen bond model describing the interaction of U1 snRNA and its binding sites. We suggest hydrogen bonding patterns to describe U1 snRNA binding more accurately than the mere similarity to the consensus sequence. In particular, we take into account that for a G:U base pair different hydrogen bonding patterns are possible (37). We limited our study to GT containing U1 snRNA binding sites, since it has been estimated that only ∼3.7% of annotated splice sites do not contain the canonical dinucleotides GT and AG for donor and acceptor sites (38), respectively. We examined our hydrogen bond model of splice site description with 5′ splice site mutations of the human ATM gene containing 66 exons, and we found its applicability confirmed with respect to exon skipping, beyond its authentic context of HIV-1.

Oligonucleotides were synthesized and purified as previously described (39). The oligonucleotides are as follows: 1020 (no. 15): 5′-CATCCTTAAGCTTCTCTATCAAAGGGGTGGAAATTACATGTAATGCA. For the following primers, which are identical to 1020 except for the underlined sequence, only these nucleotides are shown: 1226 (no. 16), ATTGTGGGTCA; 1227 (no. 17), TTGGTAAATGT; 1228 (no. 18), TTGGTAGGTTT; 1264 (4G), GCAGTAGGTAG; 1329 (cs-314), CAGGTGTCATG; 1304 (3G7C), GCAGTGAGTCG; 1305 (4G7C), GCAGTAGGTCG; 1332 (-1G3U 7G8U), GCGGTTAGTGT; 1348 (SD2), AAGGTGAAGGG; 884 (-1G3U7U), 5′-CATCAAGCTTCTCTATCAAAGCGGTTAGTTGTACATGTAATGCA. For the following primers, which are identical to 884 except for the underlined sequence, only these nucleotides are shown: 544 (4A), GCAGTAAGTAG; 900 (4U), GCAGTATGTAG; 1184 (U1 5G), 5′-CGAAGATCTCATACGTACCTGGCAGGGGAGAT. For the following primers, which are identical to 1184 except for the underlined sequence, only these nucleotides are shown: 1185 (U1 5A), ATACATACCTG; 1204 (U1 6C), ATACTCACCTG; 1207 (U1 9T10G11C), ATACTTACTGC; 1222 (U1 3T4T10C11C), ATTTTTACCCC; 1230 (U1 5C10C11C), ATACCTACCCC; 1095, 5′-GGGCCTAGGAATTCTCTCTTGAGCTCGCAGTAAGTAGCTTAAGCTCTCCGAAGACAGTGGCAATGAGAGT; 1098, 5′-AATTCAGGAAGAAG CGGAGACAGCGACGAAGAGCT; 1099, 5′-GTCCTTCTTCGCCTCTGTCGCTGCTTC. For detection of double-spliced and skipped RNA the primers 1544 (sense), 5′-CTTGAAAGCGAAAGTAAAGC, and 1542 (antisense), 5′-CACCTTCTTCTTCTATTCCTT, were used. For the detection of the human growth hormone (hGH) mRNA primers 1225 (sense), 5′-CAACAGAAATCCAACCTAGAGCTGCT, and 1224 (antisense), 5′-TCTTCCAGCCTCCCATCAGCGTTTGG, were used.

The SV40 early env expression vector SV E/X tat rev carrying the EcoRI–XhoI fragment of pNLA1 (40) and the 5′ splice site mutations were constructed as previously described (3). SVcrev was constructed by recloning the EcoRI–XhoI fragment from pUHcrev (41) into pSVT7.

For the construction of plasmid pUCBΔU1 the BglII–PstI fragment (nucleotides 440–726) of the parent plasmid pUCBU1 (kindly provided by A. Weiner) was substituted for a linker (1132/1133) containing a unique BglII, PstI and XhoI site. For the mutations, the BglII–XhoI fragment (nucleotides 440–464) of pUCBΔU1 was substituted for a PCR-amplified fragment with a 5′ PCR primer carrying the BglII site and the desired mutation, e.g. 1204 (U1 6C), and the 3′ PCR primer 1131.

Plasmids pBSU1 and pBSU2b used for U1 snRNA and U2 snRNA detection contain U1 snRNA or U2 snRNA sequences in the antisense orientation. Plasmid pDS-H6F1-SF2ΔRS was used for purification of recombinant SF2 protein with a deletion of the RS domain.

pSV-1-env SD2, pSV-1-env –1G3U, pSV-1-env 4G7C and pSV-1-env 4C8U were constructed by substituting the 5′ splice site no. 4 containing SacI–NheI fragment of pSV-1-env with the respective SacI–NheI fragments from the SV E/X tat rev constructs including the respective splice site mutation (3).

All PCR-amplified sequences were confirmed by DNA sequencing. The sequences of all constructs are available on request.

Under our experimental conditions we could not observe any influence of the co-expression of the Env-receptor on the levels of Env protein expression. In order to be able to microscopically observe the formation of syncytia in the course of the transfection experiment, we transfected (3) HeLa-T4+ cells (42) with FuGENE™ 6 (Roche Molecular Biochemicals), rather than using HeLa cells. The transfection efficiency was biochemically monitored by cotransfection of pGL3-control (Promega). The medium was changed after 24 h, and cells were harvested and prepared 48 h after transfection. For RT–PCR cells were transfected with 1 µg of the respective pSV-1-env plasmid and 1 µg of pXGH5 (43). Transfection was performed by using 4 µl of FuGENE™ 6 and total RNA was isolated after 30 h.

Cells were scraped from six-well plates into the medium, sedimented at 12 000 g for 14 s, washed twice in PBS and suspended in 200 µl of SDSpolyacrylamide gel electrophoresis sample buffer (44). An aliquot of the PBS washing step was analyzed for luciferase activity (Luciferase Assay System; Promega). Protein concentrations were determined by the Bradford Protein Assay (Bio-Rad). All samples were adjusted to equal levels of luciferase activity and similar protein concentrations by adding extracts of mock-transfected cells. Samples were subjected to electrophoresis on SDS–7% polyacrylamide gel and transferred to a PVDF membrane (Immobilon™ P, pore size 0.45 µm; Millipore) by electroblotting with 70 V in transfer buffer (200 mM glycine, 25 mM Tris, 20% methanol) for 1 h. Blots were blocked in PBS with 10% bovine serum albumin (BSA), 10% Tween®-20, for 20 h. Protein detection was performed in PBS, 1% BSA, 1% Tween®-20, with a monoclonal mouse-anti-gp120 antibody (87-133/026, 1:5000, kindly provided by Dade Behring) for 1 h, washed four times, incubated with a sheep-anti-mouse antibody conjugated with horse-radish peroxidase (NA 931, Amersham, 1:1000), washed four times, rinsed with water and visualized by a chemiluminescence detection system (ECL™-system and ECL™ hyperfilm, Amersham; Super Signal® ultra, Pierce).

Biotinylated in vitro transcripts of AflIII-linearized SV E/X tat rev plasmids carrying mutations of the tat/rev 5′ splice site were prepared using T7-MEGAshortscript™ (Ambion), Biotin-16-uridin-5′-triphosphat (Roche Molecular Biochemicals) and diluted [32P]UTP as tracer for concentration determination. Complex formation was performed in a 200 µl reaction volume containing 70 µl of HeLa nuclear extract (Cell Culture Center, Belgium), 30 µl buffer D (20 mM HEPES–KOH pH 7.9, 100 mM KCl, 20% glycerol, 0.2 mM EDTA, 0.5 mM DTT), 20 µl of 10 mM ATP, 20 µl of 50 mM creatine phosphate, 20 µl of 25 mM MgCl2 and 1 µg of biotinylated transcripts by incubation at 30°C for 20 min. The Dynabeads® M-280 Streptavidin (Dynal) were preblocked in NET2 buffer (50 mM Tris–HCl, 0.05% Nonidet P-40, 0.5 mM DTT) containing 0.5 mg/ml tRNA (Escherichia coli MRE 600; Roche Molecular Biochemicals) and 0.2 mg/ml BSA, incubated with the reaction volume at 4°C for 1 h, washed four times in 1 ml of washing buffer (0.05% Nonidet P-40, 500 mM KCl, 20 mM HEPES–KOH pH 7.6, 1 mM MgCl2, 0.5 mM DTT, 0.1 mM EDTA) over a total of 2 h. The complexed RNAs were eluted by incubating the beads in 300 µl of denaturation buffer (20mM Tris–HCl pH 7.5, 4 M urea, 0.5% SDS, 10 mM EDTA, 0.3 M NaCl) at 37°C for 5 min. After phenolchloroform extraction and ethanol precipitation, samples were resuspended in formamide loading buffer and incubated at 95°C for 30 s, followed by fractionation on an 8% denaturing polyacrylamide gel. The RNA was blotted onto a positively charged nylon membrane (Roche Molecular Biochemicals) as previously described (3). The complexed snRNAs were identified by probing the membrane with [32P]UTP-labeled antisense transcripts of U1 snRNA and U2 snRNA. Detection of U2 snRNA served as a control for the pull-down assay.

Isolation of total RNA was performed using a modified guanidinium isothiocyanate protocol (45). Cells were washed twice with 2 ml of PBS and cell lysis was performed with 500 µl of buffer D [7.6 µl of 2-mercaptoethanol, 50 µl of 3 M sodium acetate (pH 4)]; 500 µl of phenol and 100 µl of a chloroform–isoamyl alcohol mixture (24:1) were added and mixed for 15 s. After incubation on ice for 15 min, phases were separated by centrifugation (10 600 g, 4°C, 20 min, Eppendorf centrifuge). RNA was precipitated in 1 vol of isopropanol overnight. After centrifugation (10 600 g, 4°C, 20 min, Eppendorf centrifuge) RNA was washed twice with 70% ethanol and dissolved in 10 µl of DMDC-ddH2O.

Prior to RT, 4 µl of RNA samples were subjected to DNAse I digestion using 10 U DNAse I (Roche Molecular Biochemicals) with 50 mM Tris pH 7.5 and 10 mM MgCl2 in a total volume of 10 µl at room temperature for 1 h 45 min. After DNAse I inactivation at 95°C for 10 min, 4.5 µl of the DNAse-digested RNA samples were reverse transcribed with 200 U SuperScript III RNAse H Reverse Transcriptase (Invitrogen) according to the manufacturer’s protocol using 0.375 mM oligo(dT)15 (Roche Molecular Biochemicals) as primer. As a negative control for the remaining plasmid DNA contamination of each sample, a second assay was performed as described above but replacing reverse transcriptase with ddH2O. PCR was carried out with 1.25 U AmpliTaq (Applied Biosystems) in a total volume of 50 µl according to the manufacturer’s protocol in a Robocycler Gradient 96 Temperature Cycler (Stratagene). All primers were used at a final concentration of 0.2 µM. Double-spliced and skipped RNA was detected with the primer pair 1544/1542 and hGH mRNA with primer pair 1225/1224. Prior to PCR the cDNA reaction mixture was denatured at 94°C for 3 min. To determine the linear PCR-amplification range allowing a semi-quantitative estimation of the relative abundance of pSV-1-env and hGH mRNA, a preliminary PCR test series was carried out using the same cDNA sample but varying the PCR cycle numbers between 15 and 30 [94°C, 0.5 min; 52°C (pSV-1-env) and 56°C (hGH), respectively, 1 min; 72°C, 1 min]. The reactions were completed with a final elongation step of premature amplified products at 72°C for 10 min. Accordingly to the obtained results, PCR analysis was performed with 26 cycles for pSV-1-env as well as hGH PCR amplification.

PCR products were separated on 6% non-denaturating polyacrylamide gels, stained with ethidium bromide (10 min) and visualized with the Lumi-Imager F1 (Roche Molecular Biochemicals).

[32P]UTP-labeled in vitro transcripts (30 000 c.p.m) made from AflII-linearized plasmids with (SV GAR Δvpu env) and without (SV ΔGAR Δvpu env) the GAR motifs (trinucleotide sequence with R = A or G) were incubated in a 10 µl reaction volume containing 5 µl of 2× band-shift buffer (20 mM HEPES–KOH pH 7.6, 200 mM KCl, 4 mM MgCl2, 1 mM EDTA, 2 mM DTT, 20% glycerol, 100 ng/µl tRNA), 0.1 µl of RNasin (Promega) and 1 µl of SF2ΔRS (0–1000 ng) diluted in SF2 buffer (0.5 M GuHCl, 20 mM HEPES–KOH pH > 7.9, 42 mM (NH4)2SO4, 0.5 mM DTT, 0.2 mM EDTA, 15% glycerol, 1 mM PMSF) were incubated at room temperature for 15 min and analyzed on a native 5% polyacrylamide gel. Gels were subjected to autoradiography.

Recently, we have shown that inherently unstable HIV-1 env pre-mRNA can be protected from degradation by RNA duplex formation between the HIV-1 5′ splice site SD4 and U1 snRNA (Fig. 1A and C) (3). To determine what degree of sequence complementarity is necessary for the 5′ splice site to bind U1 snRNP in vitro, we compared U1 snRNP binding to SD4 (GCAGTAAGTAG) and three mutant 5′ splice sites, 4C8U, 3U and SD4. We used a pull-down assay with biotinylated in vitro transcripts comprising upstream sequences of SD4, the different splice-site mutations and a short portion of the intron sequence. Following incubation of these transcripts with HeLa nuclear extracts, total RNA was extracted from the formed complexes and analyzed by northern blotting for the presence of U1 snRNA (Fig. 1B). Only with the authentic 5′ splice site, SD4 complexes were formed containing significant amounts of U1 snRNA, whereas the mutant SD4 transcripts—even if they carried just one single mismatchfailed to bind appreciable amounts of U1 snRNP (Fig. 1B). This result confirmed our previous observation that, in the case of a discontinuous stretch of hybridization potential, less than 15 hydrogen bonds were insufficient for RNA duplex formation between the HIV-1 5′ splice site SD4 and U1 snRNA. Sensitivity differences between the in vitro and in vivo assays (Fig. 1B and C) could have caused the slightly different result for mutation 4C8U with its intermediate glycoprotein expression. As an additional interesting result, this experiment showed that U2 snRNP binding occurred even in the absence of U1 snRNP binding.

In order to derive a hydrogen bonding model of general significance from the study of interactions between SDs and U1 snRNA, we strived to study alternatives to SD4.

Based on our preliminary hydrogen bonding model derived by Kammler et al. (3), we looked for potential, yet unidentified, U1 snRNA binding sites within the HIV-1 env transcript that would eventually function to stabilize the pre-mRNA. To this end, we scrutinized the env gene between the translational start codon and the splice acceptor SA7 to predict binding sites. For all 11 nt long sequences containing a GT dinucleotide, we counted the number of hydrogen bonds (NH) in a continuous or discontinuous stretch of neighboring base pairs. An A:U base pair counted for two and a G:C base pair for three hydrogen bonds. From our previous western blot analyses we classified hydrogen bonding patterns as ‘high complementarity’ (HC) if they ensured env expression and ‘low complementarity’ (LC) otherwise. A hydrogen bonding pattern qualified as HC, if NH ≥ 14 for a continuous stretch and NH ≥ 15 for a discontinuous stretch, with the additional constraint that for a discontinuous stretch 5′ to the mismatch NH ≥ 6 and 3′ to the mismatch NH ≥ 5. Beyond the Watson–Crick A:U base pairs analyzed in our previous study, we assumed G:U wobbles to form two hydrogen bonds. Tentatively applying the so defined algorithm, we found four potential U1 snRNA binding sites (designated nos 15–18) within this part of the HIV-1NL genome which so far had not been identified as functional 5′ splice sites. Of note, all four potential U1 snRNA binding sites contained at least one G:U base pair. To examine whether these putative U1 snRNA binding sites enabled 5′ splice site-dependent Rev-mediated env expression, we substituted SD4 with each of them in our env expression vector. Unexpectedly, only one, no. 18, mediated normal env expression; the other three (nos 15–17) showed very little, if any, env expression (Fig. 2).

To examine whether env expression mediated by the predicted binding sites 15–17 could be restored by increasing their complementarity to U1 snRNA, we repeated the transfection experiment with the env expression vector carrying site 15 and expression vectors for wild-type U1 snRNA and three different site-mutated U1 snRNAs, normally acting as suppressors. In this case, they served as probes that might increase complementarity to the predicted binding site 15. As already seen in Figure 2, the putative U1 snRNA binding site 18 supported env expression with wild-type U1 snRNA, whereas site 15 did not (Fig. 3). Likewise, site 15 did not support detectable env expression following cotransfection with the mutated U1 snRNAs 6C or 3U4U10C11C, whereas with the site-mutated U1 snRNA 5C10C11C some env expression was detected (Fig. 3). The assumption of a G:U wobble always contributing two hydrogen bonds to the duplex formation could not adequately describe these experimental results.

This consideration led us to explore other possibilities. In the duplex r(GGCGUGCC)2, the G:U wobble forms only one hydrogen bond instead of the expected two for the canonical wobble pair (37). Assuming this one-hydrogen-bond model for G:U base pairing in our context, only sites 17 and 18 would have been expected to successfully substitute for SD4 (Fig. 2, column ‘G:U (1)’). However, predicted U1 binding site 17 had been found not to support env expression (Fig. 2), and was thus studied further. In addition to site 17, two site-directed mutants, –1G5U and –1G5U8U, were generated and tested. –1G5U and –1G5U8U contained a continuous stretch of 12 hydrogen bonds identical to site 17 in positions –1 to +5, but had different base pairs in positions +7 and +8. Even if site 17 could form only one hydrogen bond at the G:U wobble in position +7, this assumption would result in a higher total number of hydrogen bonds than site –1G5U. env expression, however, was extremely low upon transfection of HeLa-T4+ cells with the expression vector carrying site 17 (Fig. 4). In particular, it was lower than with –1G5U. From these results, we cautiously concluded that a G:U base pair in position +7, flanked on either side by no more than two hydrogen bonds, did barely contribute to the stability of the RNA duplex. Similar results were obtained analyzing mutants with a G:U base pair in position –2 (data not shown), at the other pre-terminal position of RNA duplex formation.

Next, we looked at the behavior of internal G:U wobbles as regards the applicability of the one-hydrogen-bond hypothesis. We examined the contribution of a G:U wobble in position +4 of SD4 to the 5′ splice site-dependent Rev-mediated env expression. SD4 has an A in this position pairing with the U of U1 snRNA. By mutating this position we were able to analyze the effect of a G:U wobble that was flanked by seven hydrogen bonds on either side on RNA duplex formation (Fig. 5). Any hydrogen bond formation within this mutated position +4 would have converted the discontinuous stretch into a continuous one with NH > 14 hydrogen bonds. Substitution of the A in +4 with any of the three other possible nucleotides, however, did not lead to a comparable env expression. Therefore, we concluded that a single G:U wobble in +4 does not contribute to RNA duplex formation even if it is flanked by seven Watson–Crick hydrogen bonds on both sides.

We would not have expected these results on thermodynamic grounds employing two algorithms, Dynalign for finding secondary structures common to two RNA sequences (46) and HyTher, a nearest-neighbor model that predicts hybridization thermodynamics of a given duplex (47). Both algorithms predicted only little difference for the duplex free energies with an A:U base pair or a G:U wobble [ΔG° 0.3 and 0.35 kcal/mol (Dynalign and HyTher), respectively] in contrast to a greater difference for the C mismatch [ΔG° 4.1 and 4.05 kcal/mol (Dynalign and HyTher), respectively].

To confirm that in this sequence the G:U wobble at +4 does not contribute to RNA duplex formation, whereas base substitutions that pair by hydrogen bonds do so, cotransfections with site-mutated U1 snRNAs were performed. As shown in Figure 6, env expression of the SD4 mutant 4C was restored only in the presence of the matching site-mutated U1 snRNA, pUCBU1 5G, while the same U1 snRNA failed to restore env expression of the SD4 mutant 4U. env expression of SD4 mutant 4U, on the other hand, was specifically restored in the presence of the matching site-mutated U1 snRNA, pUCBU1 5A, not by pUCBU1 5G.

This experiment convincingly showed that in this sequence a G:U wobble in +4 does not contribute to RNA duplex formation supporting 5′ splice site-dependent Rev-mediated env expression. However, it did not rule out the possibility that a G:U base pair in position +4 could contribute to the RNA duplex stability if it was flanked by a G:U base pair in position +3. Consequently, the algorithm did not count any hydrogen bonds for a single G:U base pair in position +4.

Next we set out to study the contribution of a G:U wobble at position +3 to env expression and its dependency on the presence of flanking hydrogen bonds. We used two additional SD4 mutants containing a G:U wobble at position +3. In a continuous stretch of six complementary nucleotides, from –3 to +3, a G:U wobble at +3, terminal to the RNA duplex, did not enable env expression (Fig. 7, see cs –314 versus cs –315) and behaved like cs –313 that has a mismatch t at +3. However, if this G in +3 was flanked on both sides by complementary Watson–Crick base pairs, i.e. was no longer terminal, then the G:U wobble contributed to the env expression, albeit to a lesser extent than an A:U base pair (Fig. 7, see 3G7C and cs –315).

To account for these experimental results, we assumed for our algorithm that any G:U wobble terminal to an RNA duplex does not contribute hydrogen bonds to the RNA duplex, in general accordance with the thermodynamic evidence that lack of stacking interactions at helix termini reduces the energy contribution of terminal base pairs. As a quantitative approach to this problem, we calculated the free energies of these mutant duplexes by Dynalign and HyTher. Unfortunately, both programs failed to correlate the free energy of these RNA duplexes with the efficiency of the 5′ splice site-dependent Rev-mediated env expression (Fig. 7, columns ΔG°).

A remaining inconsistency in describing our experimental results by the suggested hydrogen bond model becomes apparent in the mutant sites cs –313, cs –314 and 3G7C. All three are predicted to form a continuous stretch of 13 hydrogen bonds. Assuming a terminal base pair—as here a G:C at position –3—to contribute one hydrogen bond less, assigns 12 hydrogen bonds for cs –313 and cs –314, but leaves 3G7C unchanged with 13 bonds. Calculated this way, a threshold of NH ≥ 13 for a continuous stretch of hydrogen bonds was found to correctly describe env expression in all our experiments.

We also examined whether the modification proposed for the calculation of Watson–Crick hydrogen bonds at the 5′ terminus of the 11 nt long sequence would also hold for its 3′ end, i.e. position +8. To examine this question, we compared a further set of SD4 splice sites mutated at the 3′ end (Fig. 8). We purposely chose mutants with suboptimal env expression to avoid saturation effects. Without exception, the classification as HC for the hydrogen bond patterns of these mutants was consistent with env expression.

Classification of hydrogen bonding patterns in terms of HC or LC was derived from experiments with a single intron construct and an env-mRNA degradation assay. In order to examine the contribution of U1 snRNA binding to exon recognition and splicing, we supplemented the transcription unit with HIV-1 exon 1 and the 5′ portion of intron 1, thus obtaining a 2-intron HIV-1 sub-genomic env expression construct (as outlined in Fig. 9A).

Moreover, we substituted SD4 with four different 5′ splice sites. Two of them [SD2 (80.08), –1G3U (79.34)] revealed high Shapiro–Senapathy scores (11) of ∼80, and the other two [4G7C (65.66), 4C8U (64.84)] had low scores of ∼65. Each pair contained one site graded as HC (–1G3U, 4C8U) according to our hydrogen bond model. These sites were verified as functional 5′ splice sites by detection of an exclusively double-spliced RNA/cDNA pattern (Fig. 9B). Correspondingly, the western blot confirmed env expression (Fig. 9D). The remaining two 5′ splice sites were both graded as LC, showed a single-spliced RNA/cDNA pattern indicative of exon skipping, and the pertinent immunoblots failed to detect Env glycoprotein (Fig. 9B and D). Notably, in this example the high Shapiro–Senapathy score 5′ splice site SD2 (authentic HIV-1 5′ slice site), graded as LC by our hydrogen bond model, led to exon skipping, whereas the low Shapiro–Senapathy score 5′ splice site 4C8U, graded as HC by our hydrogen bond model, showed no exon skipping (Fig. 9B, see lanes 1 and 4).

In contrast to the single-intron env expression construct, where lack of U1 snRNA binding led to complete RNA degradation, in the two-intron construct the RNA degradation was incomplete, possibly due to additional U1 snRNA binding at SD1. If both 5′ splice sites were used, the amount of RNA exceeded the amount found in the case of exon skipping, indicating a possibly additive effect of two U1 snRNA binding sites protecting RNA against degradation.

A concern was the widespread applicability of our hydrogen bond model since all experimental work was done on HIV-1 SD recognition. To test whether the classification in terms of HC and LC obtained from our hydrogen bond model could be extended to predict exon skipping occurring in human genetic diseases, we chose a prototype human gene known to harbor a high number of exons and splice mutations. The ATM gene (ataxia telangiectasia mutated, MIM 208900, Louis Bar syndrome) includes 66 exons and more than 300 different mutations are documented (ATM Mutation Database, http://www.benaroyaresearch.org/bri_investigators/atm.htm). From all available, published and unpublished mutations and polymorphisms, we selected those involving the nucleotides –3 to +8 of 5′ splice sites, except for the canonical GT dinucleotide in positions +1 and +2. It is well known that point mutations at the 5′ splice site do not necessarily cause alterations of the transcript. Therefore, we further selected only those changes at positions –3 to –1 and +3 to +8 within ATM SD sites where RNA/cDNA studies had been done and pertinent information was available. Given these limitations, a total of 21 sequence changes within 19 different ATM 5′ splice sites were collected (Table 1). Seventeen among these 21 changes in ATM 5′ splice sites led to exon skipping and thus represented mutations. The hydrogen bond patterns of the mutated SDs consistently revealed a lower complementarity than the corresponding patterns of their physiological counterparts (Table 1). Similarly, the Shapiro–Senapathy 5′ splice site score (11) was found to be decreased in all these mutations compared with that of the corresponding wild-type 5′ splice site. Three of the remaining four sequence changes that did not lead to an alteration of the transcript and thus qualified as polymorphisms, had an identical hydrogen bond pattern and also a comparable Shapiro–Senapathy score as their corresponding wild type. Calculations showed that the correlation between exon skipping and the change in classification from HC to LC was significant even in our relatively small data set of 21 sequence alterations in 5′ splice sites of the human ATM gene (Spearman ρ = 0.54, P = 0.018, χ2-test). Thus, our hydrogen bond model proved to be applicable in the characterization of human SD changes.

Interestingly, 14 among the regarded 19 physiological ATM 5′ splice sites were classified as HC at the threshold levels defined above. In contrast, four among the remaining five physiological 5′ splice sites that were classified as LC showed a continuous stretch of 12 hydrogen bonds, which is just below the threshold defined from HIV-1. In these cases, splicing of the human ATM transcript may be mediated by an ESE.

To examine a possible dependence between the presence of ESEs and sequence complementarity of SD4 to U1 snRNA, we analyzed RNA duplex formation of HIV-1 transcripts with and without the GAR motif upstream of SD4, which has been shown to influence U1 snRNA binding (3). As shown here and previously, substituting four single Gs of the GAR motif with Cs (GAR) led to a reduction of env expression, which was estimated to be 4-fold (Fig. 10A, lanes 2 and 5, and C, lanes 1 and 2). Interestingly, cotransfection of a site-mutated U1 snRNA complementary to nucleotides –3 to +7 of SD4 (U19U10G11C) partially restored env expression in the absence of the upstream GAR repeats (Fig. 10A, lanes 5 and 6). To confirm that an increased number of complementary base pairs could compensate for the lack of the GAR repeats, we analyzed env expression of the construct SV E/X tat rev GAR cs –120 that had two more complementary bases than splice site SD4, but inactivated GAR repeats. As shown in Figure 10C, env expression was fully restored, indicating that increasing the complementarity of the RNA duplex sufficiently facilitated U1 snRNA binding, dispensing with the necessity of GAR support.

Based on our mutational analysis of the HIV-1 5′ splice site SD4, initiated in Kammler et al. (3) and extended in this paper, we described U1 snRNA binding sites with a hydrogen bond model. An algorithm, available at http://www.splicefinder.net/, systematically classified U1 snRNA binding sites as HC or LC according to their capability to functionally replace SD4 in the 5′ splice site-dependent Rev-mediated env expression.

The 5′ splice site consensus sequence only includes nucleotides in positions –2 to +6 (13), taking into account the concept proposed by Shapiro and Senapathy to calculate 8mer 5′ splice site scores (11). However, Hibbert and coworkers (48) and we were able to identify 5′ splice sites, which were functionally dependent on complementary nucleotides in positions +7 or +8 or both (3) (Fig. 4). Moreover, a preference for a purine base at position +7 was also recently found using a strategy to select functional 5′ splice site sequences in vitro from a pool of pre-mRNAs containing randomized 5′ splice sites after three rounds of selection (4). Consistent with these findings, here we demonstrated that in an in vivo RNA splicing assay, complementary nucleotides in positions +7 or +8 contributed to exon recognition. Therefore, we included these two positions into calculations by our algorithm, although statistically the contribution of nucleotides in these positions to splicing is a rare phenomenon, as is apparent from nucleotide frequency distributions.

Recently, it has been shown by NMR spectroscopy that only one hydrogen bond is formed by G:U wobbles, depending on their adjacent base pairs (37). Therefore, we analyzed the contribution of G:U wobbles to the stability of the RNA duplex in all possible positions of the 5′ splice site. In all cases analyzed, a G:U consistently contributed less to the U1 snRNA-dependent Rev-mediated env expression than an A:U base pair in the same position. Since an A:U pair contributed a maximum of two hydrogen bonds, we permitted a maximum of one hydrogen bond for a G:U wobble. However, if a G:U was not stabilized by a sufficient number of adjacent hydrogen bonds, it had to be treated as a mismatch (Fig. 7, see mutants cs –314 versus 3G7C). In this experiment, a point mutation A3G completely suppressed env expression (Fig. 7, see mutants cs –315 versus cs –314), which is unexpected from the point of view of the frequency-based Shapiro–Senapathy concept assigning nearly equal weight to A and G in position +3. This is in marked contrast to X-linked retinitis pigmentosa where the IVS4+3a>g point mutation in the retinitis pigmentosa GTPase regulator (RPGR) gene leads to skipping of exon 4 and disease (49), and our algorithm correctly predicts exon skipping (CAGGTAtagtg, graded as HC with 14 hydrogen bonds, versus CAGGTgtagtg graded as LC with 12 hydrogen bonds). Our experimental results also parallel statistical findings by Burge and Karlin, who observed a dramatic preference for A over G at position +3 in introns carrying a mismatch in position +5 (21).

The amount of env expression as a measure for U1 snRNA binding is described correctly by our hydrogen bond model, when we assume that the terminal base pairs of the 5′ splice site, i.e. –3 and +8, contribute one hydrogen bond less to the stability of the duplex than is given by Watson–Crick pairing. The stability of a base pair in position –3, for instance, might be reduced due to structural hindrance caused by a decreased flexibility of the U1 snRNA in this position next to stem I (50), which is hidden in a doughnut-shaped central element (51).

The accessibility of the U1 snRNA binding site might also be influenced by local intra-strand RNA binding at or around the splice site as has been shown for the HIV-1 splice site A3 (52). To investigate whether secondary structures could have influenced U1 snRNA binding at SD4, we substituted SD4 with the stem–loop structure carrying SD1 (53) in another set of mutant vectors. However, even an extended stem did not lead to a decrease in the 5′ splice site-dependent Rev-mediated env expression (data not shown). Therefore, we consider it unlikely that the mutational analysis presented in this paper was additionally influenced by RNA secondary structure.

Our examination of 5′ splice sites within the human ATM gene has only exemplary character to show the applicability in principle of our algorithm to data other than the HIV-1 genome. This gene includes a total of 64 canonical SDs. Examination of these 64 sequences using our algorithm yielded 51 correct classifications as HC (sensitivity 80%) when we set a threshold of Scon = 13 hydrogen bonds as derived from our studies with HIV-1 SD4. Lowering the threshold in one of the parameters by one hydrogen bond to Scon = 12 resulted in 60 correct SD classifications (sensitivity 94%). In a data set of 17 SD mutations comprising the positions –3 to –1 and +3 to +8 of the human ATM gene leading to exon skipping, the same modification of the threshold decreased the sensitivity to classify a mutation as LC from 16/17 (94%) to 14/17 (82%). This is in accordance with the concept that ESE or ISE sequences can compensate for ‘weak’ (non-consensus) 5′ splice sites, and that ‘strong’ splice sites may eliminate enhancer dependency (54). Experimentally, this relationship between the strength of a 5′ splice site (HC/LC) and a splicing enhancer is confirmed by our findings in HIV-1. The dependency on the upstream sequence for the recruitment of U1 snRNP to the 5′ splice site SD4 was abolished simply by increasing the complementarity between SD4 and the free 5′ end of U1 snRNA. This fragment supported ASF/SF2 binding in an in vitro gel mobility shift assay (data not shown). Thus, a more refined classification of a hydrogen bonding pattern as HC or LC will have to take flanking enhancer sequences into account. Leaving the basic counting rules in our hydrogen bond model unchanged, the effects of, for example, different SR protein binding sites, might be included just by suitably redefining the HC conditions in terms of threshold parameters. Up to now, 11 human SR proteins have been described that may support U1 snRNP binding to 5′ splice sites in a hierarchical way at different levels of complementarity. Relatively, HC would require less stabilization so that most SR proteins could support U1 snRNP binding, which might explain their functional redundancy (reviewed in 33). Non-redundant functions might be attributed to the capability of an SR protein to stabilize U1 snRNP binding even in the absence of complementarity. Basically, the SR proteins’ capacity for RNA duplex stabilization might depend on different interactions of their respective RS domains and U1 70K. Although it has been shown that the RS domains of ASF/SF2 and SC35 are functionally interchangeable (29), biochemical evidence has been provided that U1 70K interacts differently with some SR proteins (25).

Due to the availability of the human genome sequence and knowledge of several thousands of alternative and constitutive real 5′ splice sites, our algorithm could generate large data sets of real 5′ splice sites which have comparable capability to support U1 snRNA binding according to their hydrogen bonding patterns. Using such data sets it should be possible to identify common motifs within the flanking sequences of 5′ splice sites that might act as enhancer or silencer sequences.

Any genuine prediction of functional 5′ splice sites will have to take splice enhancers/silencers into account. Thus, at this stage, the power of our algorithm lies in its value to predict the presence of enhancer/silencer sequences.