dict.md logo

RECOVIR: An application package to automatically identify some single stranded RNA viruses using capsid protein residues that uniquely distinguish among these viruses

Most single stranded RNA (ssRNA) viruses mutate rapidly to generate large number of strains having highly divergent capsid sequences. Accurate strain recognition in uncharacterized target capsid sequences is essential for epidemiology, diagnostics, and vaccine development. Strain recognition based on similarity scores between target sequences and sequences of homology matched reference strains is often time consuming and ambiguous. This is especially true if only partial target sequences are available or if different ssRNA virus families are jointly analyzed. In such cases, knowledge of residues that uniquely distinguish among known reference strains is critical for rapid and unambiguous strain identification. Conventional sequence comparisons are unable to identify such capsid residues due to high sequence divergence among the ssRNA virus reference strains. Consequently, automated general methods to reliably identify strains using strain distinguishing residues are not currently available.

We present here RECOVIR ("recognize viruses"), a software tool to automatically detect strains of caliciviruses and picornaviruses by comparing their capsid residues with built-in databases of residues that uniquely distinguish among known reference strains of these viruses. The databases were created by constructing partitioned phylogenetic trees of complete capsid sequences of these viruses. Strains were correctly identified for more than 300 complete and partial target sequences by comparing the database residues with the aligned residues of these sequences. It required about 5 seconds of real time to process each sequence. A Java-based user interface coupled with Perl-coded computational modules ensures high portability of the software. RECOVIR currently runs on Windows XP and Linux platforms. The software generalizes a manual method briefly outlined earlier for human caliciviruses.

This study shows implementation of an automated method to identify virus strains using databases of capsid residues. The method is implemented to detect strains of caliciviruses and picornaviruses, two of the most highly divergent ssRNA virus families, and therefore, especially difficult to identify using a uniform method. It is feasible to incorporate the approach into classification schemes of caliciviruses and picornaviruses and to extend the approach to recognize and classify other ssRNA virus families.

Most non-bacterial epidemic outbreaks are caused by single stranded RNA (ssRNA) viruses. Typically, these viruses undergo rapid genetic mutations that result in a large and dynamic population diversity, which are seen as different virus strains utilizing multiple hosts [1]. Relationships among the strains are usually inferred through conventional homology based comparisons using complete capsid sequences or other genomic regions. Such comparisons seek to identify clusters of similar sequences that comprise major groups (genogroups or genera) and sub-groups (species and serotypes). These groupings are targeted by various diagnostics [2-13] to recognize and classify the viruses.

Four calicivirus genera (noroviruses, sapoviruses, lagoviruses and vesiviruses) and nine picornavirus genera (apthoviruses, cardioviruses, enteroviruses, erboviruses, hepatoviruses, kobuviruses, parechoviruses, rhinoviruses and teschoviruses) are known [14-18]. Further divisions of these genera reflect more detailed sequence relatedness among these viruses. For example, among caliciviruses, noroviruses are divided into two genogroups GI and GII each of which contains seven sequence clusters (GI.1–GI.7 and GII.1–GII.7) [18-21], sapovirus sequences are grouped into 2–5 genogroups, of which, each contains several clusters [22-24], vesivirus sequences are known to contain at least 40 immune response related antigenic serotypes and lagovirus sequences cluster into proposed sero-specific groups [25]. Similarly, the 9 picornavirus genera are classified into several species, each of which consists of a large number of serotypes (Table 1).

The total number of genera, species, serotypes and the associated strains may be taken as a measure of the diversity of a given virus family. Based on this measure, the picornaviruses and the caliciviruses are two of the most highly divergent ssRNA virus families [14,15,17]. Within the genomes of these two virus families, the coat protein sequences show the lowest overall identity, possibly due to the large number of mutations that the viruses undergo to evade host immune responses.

The coat protein consists of a single subunit for the caliciviruses and four subunits (VP1–VP4) for the picornaviruses [15,26]. Structures of picornavirus coat protein [27-31] show that the exposed parts of VP1 subunits contain most of the important neutralization sites [29,32,33] thus making VP1 the immunodominant regions. Additionally, in comparison with the other genomic regions, VP1 sequences show the strongest statistical and phylogenetic support in establishing consistent genetic and antigenic relationships among picornaviruses [15,34]. Therefore, VP1 sequences are most frequently used in molecular evolution and strain characterization studies in picornaviruses.

Although caliciviruses appear to be relatively simpler because of the presence of only one coat protein sequence whose structure is known for a human and an animal strain [35,36], experimental difficulties arise in identifying strains of human caliciviruses as these viruses cannot be antigenically characterized due to their non-cultivability. Consequently, computational techniques offer the most feasible ways to recognize and predict strains of these viruses. Among the different parts of the genome, the coat protein sequences offer the best choice in computation-based strain prediction methods in caliciviruses, mainly because, of their highest overall variability and antigenic correlations as compared with sequences of the other genomic regions. This is true even for the human caliciviruses for which antigenic relationships are often deduced on the basis of antigen and antibody ELISAs using expressed capsid proteins as has been done for noroviruses [21,37-39]. We examine here a computational approach to predict strains in both picornaviruses and caliciviruses on the basis of capsid sequences.

Strain predictions critically depend on sequence information. Most existing prediction methods for caliciviruses and picornaviruses use sequence similarity cut-off values derived from homology-based sequence comparisons between target sequences and known reference sequences [20,34]. Although recent reports indicate reliable estimation of such cut-off values in distinguishing three peak regions that correspond to the genogroups, clusters and strains in noroviruses [40], no uniform criteria exist to accurately estimate these values for the other caliciviruses. Additionally, these cut-off values may be impossible to estimate when different virus genera or families are analyzed together. These difficulties are further compounded when only partial sequences from smaller and relatively more conserved regions are available [12,20,41-44]. Even for complete capsid sequences, homology based similarity scores pose limitations in determining strains of viruses due to the exponential dependence of computation time on the sequence lengths and the number of sequences thereby negatively impacting prediction accuracy.

One approach to address the computational bottleneck is to efficiently align sliding windows of target virus sequences against databases of reference sequences and use the highest overall alignment scores to genotype the target sequences [45]. However, such methods critically depend on parameters such as window sizes and reference sequences. Incorrect choice of these parameters may introduce error-inducing biases while significantly increasing the computation time due to repetitive runs using different trial values of these parameters [45].

Thus, strain recognition methods using sequence identity scores have not been easily amenable to reliable and robust automation across ssRNA virus families. Based on an earlier analysis of noroviruses [46], we describe here an implementation of a residue-wise comparison approach to automate strain predictions using both complete and partial amino acid capsid sequences of caliciviruses and picornaviruses.

The method is demonstrated using calicivirus and picornavirus capsid sequences. Defining some of these sequences as references, phylogeny based databases of capsid residues that uniquely distinguish among the reference sequences were created. Residue-wise comparisons of the input target sequences with the databases identify those phylogenetic branches whose reference sequences most closely resemble the target sequences. These branches, in turn, yield the genogroup and other classification characteristics of the target sequences thereby identifying their strains.

Partitioned phylogenetic trees were constructed for each of the 4 calicivirus genera and the 9 picornavirus genera following the procedure described earlier [46] based on evolutionary trace approaches [47]. The virus sequences, obtained from public databases [26,48], were aligned using ClustalW [49] in which the Gonnet-250 distance matrix model [50] was used with penalties for gap opening, closing, extension and separation set to 10, -1, 0.2 and 4 respectively.

The Fitch-Margoliash distance matrix was calculated for each set of aligned sequences and phylogenetic trees were constructed for each such distance matrix using the neighbor-joining Kitsch algorithm of PHYLIP package [51] as implemented on the Cambridge server [52]. The mutation rate was assumed constant throughout the tree. Each tree was displayed with the branches horizontal, the 'root' on the left and the tips on the right.

The topology of these trees, defined in terms of the connectivity and the relative branch lengths, was compared with those of other trees generated using other neighbor joining methods including the unweighted pair group method with arithmetic mean (UPGMA) distance criterion and the minimum evolution criterion using Poisson distance models of amino acid residues as implemented in the MEGA4 package [53,54]. Similar comparisons were also done between the capsid sequence trees and the trees constructed using non-capsid sequences such as those of the RNA-dependent RNA polymerase regions or of the non-coding regions. Because the topology of capsid trees alone remained almost insensitive to the method used, it was deduced that the capsid sequences were phylogenetically most robust. Therefore, any of the neighbor joining methods could be used to draw the trees for capsid sequences, and the Fitch-Margoliash based trees were chosen for further analysis. In contrast, the topologies of the non-capsid trees were quite sensitive to the tree construction method used and hence, these trees were not considered.

The trees were divided into ten equally spaced partitions P01P10. The inter-partition spacing was calculated from the maximum evolutionary distances within the trees. Ten partitions were optimum for all trees as no significant changes in node distributions were observed in any tree with a further increase in partition numbers. The sequences used to construct the trees are called "reference" sequences and the corresponding trees are called "reference" or "genus" trees because each reference tree represents a calici or a picornavirus genus.

Each partition acts as a "similarity filter" by creating different sequence groups each of which contains similar sequences emanating from a given node within the partition. Examples of such sequence groups are illustrated for a hypothetical representative tree (Fig. 1a). Starting from the root node, partition 1 contains all of the aligned sequences. Sequence comparisons in this partition, therefore, are equivalent to conventional sequence comparisons that consider all of the sequences together. Partitions 2 and 3 are identical and contain the sequence groups s1–s11 and s12 belonging to nodes A and B respectively. Similarly, the two nodes C and D belong to partition 4 while nodes E and F belong to partition 5. Node C of partition 4 contains sequence groups s1 and s2 while node D of the same partition contains nine (s3–s11) of the remaining groups. Similarly, the eight groups (s3–s10) in partition 5 belong to node E while group s11 belongs to node F in the same partition (Fig. 1a). Henceforth, sequence clusters or sequence groups will be referred to simply as groups.

For each tree, sequences belonging to different groups of a given partition were separately aligned and the resulting aligned classes were compared to obtain the consensus residues for that partition. These consensus residues were identified as "characteristic residues" that are conserved within each group but not among the different groups of the partition. For example, characteristic residue X1 at location 1 of partition 2 of a given tree may be a conserved Ala for Group A of node A in contrast to a conserved Gly (X1') for Group B of node B (Fig. 1a). Such residues were partition-wise generated for each tree.

Entire information about the genus trees including their partitions, all sequence groups for each partition and all of the characteristic residues taken group-wise, were stored in multiple 2-dimensional arrays that formed the calici and the picornavirus databases.

To identify the strain of the input query ("target") sequence whose genus is known, the program matches the target residues with the characteristic residues of each group of a given partition stored in the appropriate genus database. This is done by first aligning the target sequence with a database reference sequence. Next, starting with the second partition from the root (partition 2 in Fig. 1a), each characteristic residue of a given group in this partition is compared with the target residue at the corresponding location. Such comparisons are carried out for all partition 2 groups. The input target sequence is assigned to the partition 2 group with the maximum number of matches.

The program proceeds to the next partition, where, instead of testing all partition groups, the program tests only those that are directly tree-linked with the most recently accepted group. This greatly reduces the number of groups to be searched. The process continues until all partitions have been searched. Testing only a limited number of connected groups per partition guarantees an optimal tree search time, thereby making the program computationally efficient.

Within a given partition, once input target sequence residues are matched, they are flagged as "marked" and not considered in subsequent partitions. Exceptions to the flagging procedure are carried out only in case of ambiguities. For example, if all of the groups in a partition show an identical number of characteristic residue matches, an ambiguity is declared and no match is flagged. This ensures that all such matched residues of the input sequence are available again for matching purposes in subsequent partitions thereby helping resolve the ambiguity. Ambiguities may also occur when all groups within a given partition show no matches with the input sequence, or, if two successive partitions show identical numbers of characteristic residue matches. In both these cases, the program ignores the ambiguous partition(s) and proceeds to the next one without marking any residue. This allows all of the unmarked residues in the current partition to be compared in subsequent partitions.

To illustrate the method, consider an input sequence corresponding to a known genus whose characteristic residues X1, X2 and X3 for partition 2 are Ala, Pro and Ser respectively for group A and the corresponding group B residues (X1', X2' and X3') are Gly, Thr and Met respectively (Fig. 1a). If the aligned target sequence shows more matches for X1, X2 and X3 of group A in comparison with that of X1', X2' and X3' of group B, it implies that the input sequence belongs to group A in partition 2 and not to group B. Thus, residues X1, X2 and X3 are flagged and comparisons in subsequent partitions follow along those branches that are connected to group A (Fig. 1a). The program therefore proceeds to groups C and D in partition 4 ignoring partition 3 as it is identical to the previous partition 2 (Fig. 1a).

In partition 4, if both groups C and D show equal number of matches i.e. say, X4 and X4' are both Trp and X5 and X5' are both Leu (Fig. 1a) and the aligned target sequence also contains Trp and Leu at these locations, then the program will not flag these characteristic residues but will instead carry them over to partition 5 where these residues will again be matched in groups E and F (Fig. 1a) to determine which of these two groups maximally matches the target sequence. Similar comparisons in subsequent partitions 6–10 unambiguously identify the database strain that most closely resembles the input target sequence thereby yielding its strain characteristics.

When the genus of the target sequence is not known, the sequence is first compared with groups of representative reference sequences (≤ 3 sequences per group) from each of the genus trees in the database using ClustalW [49]. Alignment scores are computed for each group. The highest alignment score is used to select the most appropriate genus tree from the databases. Detailed strain identification is then conducted as described earlier. The use of a small number of representative sequences per group ensures rapid genus determination regardless of the number of reference sequences present in the genus tree.

Partition-wise comparisons allow RECOVIR to detect abrupt changes in phylogenetic sequence groupings among trees constructed using sequences from different genomic parts. Assuming that absence of recombination creates similar phylogenetic relationships among sequences from different regions of multiple sequence alignments, these abrupt changes or incongruities indicate nodes that may possibly contain recombination sites. For example, an abrupt change is schematically shown using a pair of hypothetical phylogenetic trees constructed using sequences from two different genomic regions of a given set of virus strains (Figs. 2a &2b). A simple incongruity has artificially been built in by interchanging sequences s1 and s10 between the two trees. Consequently, sequences belonging to nodes A and B that are distinguished by residues (X1, X2, X3) and (X11', X21', X34') in the two trees will not define the recombination sites. However, subsequent nodes C, D, E and F which reflect the incongruity between the two trees up to partition 5, will likely define the recombination sites (Figs. 2a &2b). For example, if nodes C and D are distinguished by residues (X4, X5, X6) and (X4', X5', X6') in one part of the genome (Fig. 2a) and by (Y42, Y53, Y67) and (Y42', Y53', Y67') in the other part (Fig. 2b), recombination events must involve some of these residues to explain the incongruity between the trees. Similarly, nodes E and F that are distinguished by residues (X7, X8, X9) and (X7', X8', X9') in one part of the genome (Fig. 2a) and by (Y71, Y83, Y94) and (Y71', Y83', Y94') in the other part must include the recombination sites (Fig. 2b). Similar systematic comparisons of the strain distinguishing residues in nodes G through T of the remaining partitions show that recombination events can not involve sequences s3–s8 and s11–s12 as the topology of the nodes involving these residues are identical in both trees (Figs. 2a &2b). Instead, interchange of region 1 between sequences s2 and s9 may be a possible recombination mechanism that leads to the incongruence between the two trees (Fig. 2c).

In contrast, if there are changes in some of the strain distinguishing residues of a given region without changes in the tree topology, it indicates possible spontaneous mutations. Detection of possible recombination or spontaneous mutations is done manually at present as an automated version of this feature has not yet been built into the software.

RECOVIR was initially validated by identifying the closest strains for five noroviruses and five enteroviruses of known genera from their respective complete and partial amino acid sequences of the capsids (Table 2). These viruses were chosen because of their sequence diversity and the availability of sequences of a large number of strains or serotypes in public databases. The enteroviruses included the complete VP1 sequences of Poliovirus-2 (serotype), human enterovirus (HEV)-B including the Coxsackie viruses, serotypes of the HEV-D species and the simian enteroviruses.

RECOVIR was further validated using more than 200 complete and partial sequences of different caliciviruses and more than 100 such picornavirus sequences. Among the calicivirus sequences, nearly 120 were of norovirus strains with the remaining ones being of other caliciviruses. Of the picornavirus sequences, nearly 50 were enterovirus partial sequences including those of echoviruses and other HEV-B serotypes [26], with the remainder chosen from other picornavirus genera and species (Table 1). Most of the partial sequences were randomly chosen from different regions of the capsid with some of the partial sequence lengths being only ~20% of the complete capsid sequence lengths. All norovirus and enterovirus sequences were selected from the NCBI databases [48]. None of the selected sequences were among those from the program databases to minimize possible errors due to biased sequence choices.

All functional modules of RECOVIR were written in Perl programming language on Windows XP and the Linux platforms. A Java-based graphical user interface (GUI) has been designed to access all functionalities through user-friendly I/O options (Fig. 1b).

Input selections on the GUI have been divided into three categories: sequence, databases and options to control output (Fig. 1b). Any number of query sequences may be input by either pasting them on the white board area or by browsing one or more directories for single or multiple sequence files. There is no limit to the number of input sequence files. The present version reads sequences only in FASTA format ignoring all white spaces and non-alphabet characters.

A dropdown menu in the GUI database section allows specifying the input sequence genus, if known (Fig 1b). A default reference sequence, used for aligning the input sequences and for assigning the aligned location numbers to the input sequence, then pops up in the "Select a reference" box (Fig. 1b). The default reference sequence may be changed if required. In case the genus of the input sequence is not known, the "unknown" option in this box (Fig. 1b) allows the program to automatically determine the genus and reference sequence from the built-in databases.

The "Run" button activates the program, displays a progress bar of the percentage of input sequences processed and displays results in the output section of the GUI (Fig. 1b). Depending on the "View results" options chosen, the output includes only a summary or the complete details of partition-wise matches between the database residues and the residues of the input sequences. Different GUI options allow many other details of such matches to be viewed (Fig. 1b).

Detailed strain identification results are described here for selected norovirus strains among the caliciviruses and some enterovirus strains among the picornaviruses. Complete details for the other caliciviruses and the picornaviruses can be found with the authors.

Sequences belonging to the norovirus genus tree have been described elsewhere [46]. Briefly, these sequences form a single group in partition P1 (Fig. 3a). Partition P2 splits this sequence group into the 2 known major genogroups GI and GII of noroviruses (Additional file 1) [18,46]. Group GI further divides into GIa and GIb in partition P3 while group GII divides into GIIa and GIIb in partition P4 (or P5 that is identical with P4) (Fig. 3a & Additional file 1). Partition 6 further splits both GI and GII groups. Group GIa is divided into GIa1 and GIa2 while group GIIa splits into 4 groups (GIIa1–a4). In partition 7, Group GIa splits into 4 groups GIa1–a4 while GIIa3 splits further into the GIIa3_1 and GIIa3_2 groups (Fig. 3a & Additional file 1). Groups GIb, GIIa4 and GIIb do not split up any further in this partition. Details of the remaining partitions P8–P10 are available with the authors.

The characteristic residues of the norovirus genus tree stored in the program databases unambiguously identified the closest strains of target norovirus complete capsid sequences (Table 2). For example, comparisons of the characteristic residue locations with the corresponding residues of the norovirus "Seacroft" sequence (NCBI: AJ277620) showed that the maximum number of matches in partition P2 occurred in the aligned locations 70V and 203T of group GII and not for GI (Fig. 3a). The corresponding GI locations were 70I and 201V respectively when mapped to the unaligned reference Norwalk virus (PDB ID: 1IHM) sequence (Fig. 3b). Thus, the program searched only along the GII branch of the norovirus genus tree in subsequent partitions. In partition P3, the query sequence maximally matched the GII characteristic residues 4A, 5S, 36V, 44P, 104A, 106G, 205S, 446I, 455P, 508R, 511N and 519F while the corresponding GI residues in the reference Norwalk virus sequence are 4A, 5S, 36V, 44A, 104V, 106N, 203A, 405L, 414F, 460H, 463D and 471G respectively (Figs. 2a &2b). In partitions P4 and P5, the maximum number of matches (102G, 206C, 228S, 482D, 562A) occurred in GIIa, indicating that further database searches in subsequent partitions should be restricted only to those groups that originate from GIIa.

Of the 4 possible GIIa choices (GIIa1–a4) in the following partition P6, the maximum number of characteristic residue matches occurred in group GIIa3 (Fig. 3a) corresponding to residues 11S, 13D, 82D, 83L, 103W, 181R, 185M, 227K, 250I, 306G, 329H, 397S, 409V, 436L, 492P, 516V and 519A of the unaligned reference sequence (Fig. 3b). Comparisons in partition P7 showed that of the 2 groups GIIa3_1 and GIIa3_2 originating from the GIIa3 group of the previous partition P6, the GIIa3_2 group maximally matched the characteristic residues (Fig. 3a). These matches occurred at 63A, 84S, 110R, 119T, 125V, 126S, 133G, 139I, 145F, 154T, 163E, 184C, 192T, 193G, 207M, 218F, 229R, 236L, 257P, 259N, 347T, 402A, 419V, 420F, 447S, 448E and 504V of the unaligned reference sequence (Fig. 3b). Similar comparisons in partitions P8–P10 conclusively showed that the query sequence is most similar to the minireovirus-like norovirus strain (Fig. 4)

Similarly, the input Appalachicola Bay, the Baltimore 'a' and 'b' and the Boxer strain sequences most closely resembled the Chiba, minireovirus and the Potsdam strains of noroviruses (Table 2). Strains were also correctly predicted after removing random stretches of 10–15 amino acids from these sequences indicating thereby that the strain prediction capability of the program is quite robust and is independent of the locations of the input target sequences.

Strains of five partial amino acid capsid sequences of noroviruses were consistently predicted using both the "noro" and "unknown" database options of the GUI (Fig. 1b). Only one of the 5 input sequences (1UK1) explicitly included the N-terminus residues while the remaining sequences 2JP1, 3JP2, 4JP3 and 5TP1 were from different capsid regions (Table 3).

Despite the short sizes of the input partial sequences and the variations in their capsid locations, the program unambiguously recognized the strains of all of them from their matches with the characteristic location residues. Location 70 in partition P2 determined the major genogroup in all but one sequence. Sequences 1UK1 and 5TP1, both containing 70V, belonged to genogroup GII while 2JP1 and 3JP2 belonged to genogroup GI based on the corresponding 70I residue (Table 3; Figs. 2a &2b). All of the GI residues (2JP1 & 3JP2) contained 20Q in partition P2. However, the major genogroup of sequence 4JP3 appeared to be ambiguous as it showed features of both GI and GII groups in partition P2 (ambiguity shown as ** in Table 3). It contained not only 70V, which is typical of GII sequences, but also the GI-indicator 20Q. In addition, this sequence also presented the additional ambiguity of having an equal number of characteristic residue matches in both GI and GII groups for partition P2 (Table 3). To resolve this ambiguity, the program carried over the partition P2 matches of this sequence (4JP3) to partition P3.

In partition P3, the program examined characteristic residue locations 4, 5, 26, 36 and 44 for all of the sequences (Figs. 2a, 2b). Residue 44 unambiguously confirmed the distinction between the GI and the GII group sequences in this partition. Sequences 2JP1, 3JP2 and 4JP3, by virtue of 44A, were all characterized as GI sequences similar to the reference Norwalk virus sequence (Fig. 3b; Table 3). In contrast, sequences 1UK1 and 5TP1, containing 44P, are genogroup GII sequences. In addition, because 1UK1 has 4A and 5S and sequences 2JP1, 3JP2 and 4JP3 have 26N in partition P3, choices of their genogroups were unambiguously confirmed in this partition (Figs. 2a, 2b; Table 3). Residue 36V, being conserved in all of the sequences in P3, was not of much help in determining the sequence groups in this partition (Table 3).

The program ignored partitions P4 and P5 for the 1UK1 and 5TP1 sequences due to ambiguities in residue matches. The next partition P6 showed another ambiguity of having more than one group with the highest number of matches for both these sequences (Table 3). However, matches in partition P7 clearly indicated that sequence 1UK1 belongs to GIIa1 while 5TP1 belongs to group GIIa3_2 (Table 3; Figs. 2a &2b). Further matches in partitions P8–P10 (Table 3) confirmed that 1UK1 and 5TP1 were most similar to the Beeskow and the Bitburg strains respectively (Tables 2, 3; Additional file 1).

Thus, the identified strain of the partial sequence 1UK1 is consistent with its NCBI classification as a member of the GII.4 cluster. However, 5TP1 appears to belong to the GII.3 sequence cluster according to the present analysis and not to GII.4 as shown in the NCBI database. Similarly, partitions P6–P10 allowed the program to confirm that the remaining sequences 2JP1, 3JP2 and 4JP3 indeed belong to genogroup GI (Tables 2 &3) which is consistent with the NCBI classification of these sequences. In addition, the program determined from residue comparisons that these sequences 2JP1, 3JP2 and 4JP3 are most similar to the norovirus GI Chiba, Norwalk and the Potsdam strains respectively (Tables 2 &3). Such detailed strain information is seldom available for partial sequences in public domain databases.

The program first correctly identified the strains for 5 complete and 5 partial enterovirus capsid sequences. The complete capsid sequences were those of the VP1 subunits of poliovirus, simian enterovirus, echovirus and the Coxsackievirus strains (one sequence per strain) (Table 2). Different reference sequences were tested for use in alignment with the target sequences. The target sequence strains were all correctly identified regardless of the choice of the reference sequences (Tables 2 &3). As an example, strain identification for the poliovirus strain (NCBI: DQ841140) using the porcine enterovirus-8 sequence (PEV-8: NCBI accession number AF406813) as reference is briefly described. The second partition P2 of enteroviruses contains only 2 reference groups 1 and 2 in the program database (Table 4). Of these, group 2 contains only the porcine enteroviruses (PEV) serotype 9 (PEV-9; NCBI: AF363453) indicating that the PEV-9 strains are distinct from the other enterovirus strains all of which (including the PEV-8 strain) belong to group 1. This group is characterized by nine residues 40P, 42L, 44A, 46E, 48G, 72E, 124T, 125Y, 127R, 150Q, 155P, 157G, 169W, 176S, 191P, 199Y, 202F and 203Y (reference sequence PEV-8 numbering) in the database (Table 4). Because all of these residues match the corresponding locations in the target poliovirus sequence, the program assigns group 1 to the target sequence in partition P2 and skips the next partition P3 as it is identical to P2 (Table 4).

The program detects the separation of group 1 into two groups 1a and 1b in partition P4. This creates a total of 3 groups (1a, 1b and 2) in this partition (Table 4). While group 1a contains only the PEV-8 sequences which have independently diverged from group 1 of partition P2/P3, group 1b contains the remaining group 1 sequences of these partitions. Group 2 of partition P2/P3 remains unaltered in partition P4. On comparing the characteristic residues of these three groups, the program determines that the target sequence most closely resemble group 2 in partition P4 because of the matches with residues 61L, 62Q, 78L, 106S, 115S, 131D, 152L, 170N and 174S (PEV-8 numbering) that characterize this group (Table 4). Similarly, matches with residues 43T 55G 64T 84L 156H in partition P5 allow the program to assign group 6 to the target sequence in this partition which contains nearly 60 other similar enterovirus sequences including those of polioviruses (Table 4). Similarly, residue 158G distinguishes group 7 as the maximally matching group out of the eight groups in partition P6. This group includes many similar strains in the program database including 12 strains of different coxsackievirus serotypes, one echovirus strain and strains of all 3 poliovirus serotypes (Table 4).

The three poliovirus serotypes, which remain clustered with several other human enteroviruses in partition P7, finally separate as an independent group in partition P8 (group details not shown). However, partitions P7 and P8 showed an ambiguity with reference to the input target sequence. Both of these partitions have more than one group having an identical highest number of residue matches with the target. This ambiguity did not allow the program to decide the best group in these partitions. Consequently, all group information of these partitions was carried over to the subsequent partition P9 (Table 4).

Partition P9 contains a total of 72 sequence groups including an independent group (#62) containing only the poliovirus-2 strains. The program detected this group to be the maximally matching group for the target sequence and this detection was confirmed in the next partition P10. Thus, the input target sequence was correctly detected as a poliovirus-2 strain (Table 4). Strains for the remaining picornaviruses (Tables 2 &3) were also unambiguously detected regardless of the choice of the reference strains.

Partition-wise matches for the norovirus partial sequence 6ST1 indicated spontaneous mutations. While residue 20Q in partition P2 indicates that 6ST1 belongs to GI, residue 70V in the same partition indicates that this sequence may belong to GII as seen for sequence 5TP1 (Table 3). The program, therefore, checks both these possibilities for maximum number of matches in subsequent partitions. Each subsequent partition confirms that this sequence indeed belongs to GI and is closest to the Potsdam strain (NCBI accession AF439267). Therefore, location 70, which is an isoleucine for other GI sequences (Table 3), may be a spontaneous mutation site given the relatively large number of matches found which, in turn, are expected to maintain the topology of the tree. However, a complete sequence would be required to completely rule out the involvement of this site in recombination events.

Residue-wise comparisons may suggest that a few spontaneous mutations may be sufficient to force RECOVIR to detect a wrong genogroup of the target sequences. For example, it may appear that the single mutation 44A to 44P in sequence 6ST1 might switch it to a GII group (Table 3). However, the fact that the program checks the maximum number of matches of strain diversifying residues in each partition ensures that such single mutations within the capsid does not randomly change the strain type of the sequence. This is one of the major strengths of the present method.

For the software to branch off to a different strain on the phylogenetic tree, several mutations are usually required. For example, if locations 11, 30 and 40 of sequence 6ST1 are mutated without changing the rest of the sequence, it results in the synthetic GI sequence 6ST5 belonging to group GIa1 in partition P6 instead of GIa2 that corresponded to 6ST1 (Table 3; Additional file 1). Comparisons of strain diversifying residues in further partitions result in sequence 6ST5 belonging to a different GI cluster than that of 6ST1 (data not shown). Thus, the present method is able to identify spontaneous mutations that may lead to subtle changes in the phylogenetic groups of partial sequences.

Only a few enterovirus strains could be analyzed for recombination using RECOVIR due to the manual nature of such analysis in the present version of the software. Partition-wise comparisons of strain diversifying residues of VP1–VP3 genes of the hepatitis-A (HAV) strains at the nucleotide and the amino acid levels indicated that the SLF88 (AY032861) and the MBB (M20273) strains may be putative parental strains that yield the recombinant product strain 9F94 (AJ519487) (data not shown). This is consistent with recent results obtained using conventional topology comparisons of phylogenetic trees by sliding windows across the corresponding multiple aligned sequences [55].

Despite the limited results, it appears that the present approach may be more efficient in detecting recombination sites than current methods which are inherently time consuming as their accuracy depends on detailed topological comparisons obtained from multiple runs using different sequence windows. In addition, the entire procedure must be repeated every time a new sequence needs to be analyzed. In contrast, the present method does not repeatedly compare phylogenetic trees using sliding sequence windows. Instead, it efficiently performs node-wise comparisons of the trees using a one-time created database of strain distinguishing residues.

The program rapidly identified the calicivirus and the picornavirus strains. In typical runs on outbreak sequence data available from public databases, it took only 6 and 9 minutes of real time respectively to identify the strains of a number of norovirus sequences contained in 2 files. The first file had 71 complete norovirus capsid sequences while the second file had 117 norovirus capsid sequences of which 22 were complete sequences while the remaining 95 were partial sequences each of which was between 75 and 85 residues long. This indicates a real time (including I/O) of approximately 5 seconds per sequence. Comparable speeds were also observed in processing mixtures of calicivirus and picornavirus sequences when the "unknown" database option (Fig. 1b) was required. Complete benchmark details of such strain identification runs are available with the authors.

A significant computational advantage of using complete capsid sequences to construct the RECOVIR databases is that these databases are very robust with respect to the method used to construct the database trees. The overall topology of the trees, described by the relative branch lengths and the connectivity among the branches, remain essentially unchanged regardless of the neighbor-joining method used to draw the tree. For example, in all of the trees constructed using the complete capsid sequences, the sequence groups and the characteristic residues of the groups remain unchanged for partitions P2–P9. Only the P10 sequence groups show minor changes depending on the tree construction method used (data not shown but are available with the authors). Consequently, strain identification results using RECOVIR are very reliable.

In contrast, trees constructed using a partial capsid sequences or using amino acids of more conserved and relatively smaller non-capsid regions (such as the polymerase or the non-coding parts of the virus genomes) are relatively unstable because the topology of these trees critically depend upon the method of tree construction. Therefore, databases generated from such trees are not as reliable as the ones generated using trees of complete capsid sequences.

The partitioned residue databases in RECOVIR confer important advantages to this software over existing homology-based strain identification techniques. Because each partition and its nodes highlight branching orders that provide evolutionary context of sequence relationships, the databases in RECOVIR most likely indicate biologically significant residues as seen for the noroviruses [46].

More practically, searches using the database residues are able to show increasing similarities between the target sequences and appropriate database sequences as subsequent partitions are considered. Consequently, ambiguities in strain identification, such as those occurring due to the partial nature of the sequences or those due to possible mutations in the target sequences, may be resolved at least in the first few partitions where appropriately matched sequence groups are found. Such matches are not possible using conventional homology based sequence comparisons.

Residues of the RECOVIR databases may be used to determine relatedness among the corresponding sequence groups in a partition-dependent manner thus helping in virus classification. For example, while the database residues of P2 distinguish among two of the major norovirus groups and the murine norovirus group, the corresponding residues in subsequent partitions distinguish among the more closely related sequence groups. Such residue-based distinctions among the sequence groups provide useful ways to uniquely identify many tentatively classified sequences like those of the bovine and the alphatron noroviruses.

Our earlier analysis had indicated that the bovine noroviruses may belong to genogroup I while the alphatron-like sequences may belong to genogroup II [46]. One way to understand this is to compare the inter-group evolutionary distances, measured in evolutionary units (eu), between GI, GII and the groups containing the bovine and the alphatron sequences. These distances refer to the number of mutations among the sequences. In trees constructed using neighbor-joining methods and Gonnet-250 distance matrices, the GI-GII distance, calculated between the GI and the GII groups after excluding the bovine sequences, is 0.20919eu while the corresponding GI-bovine distance, calculated between the GI and the bovine groups, is 0.18303eu (Fig. 5). Thus, the difference between these distances (0.02586eu) is less than the partition width which results in clustering the bovine sequences together with the GI sequences in the same P2 group. Similarly, the GI-GII distance, calculated after excluding the alphatron-like sequences, is more than the GII-alphatron distance calculated between the GI and the alphatron groups thereby clustering the alphatron-like sequences in the GII group of partition P2. Thus, the phylogenetic distances are consistent with the proposal that the bovine sequences appear to share sequence similarities with the GI genogroup while the alphatron sequences share GII sequence features [46].

These relative phylogenetic distances among the sequence groups may be further understood by comparing the sequences of the P2 groups. On aligning the three different P2 sequence groups, the conserved residues among the bovine sequences show significant similarities with those of the GI sequences but not with the conserved sequences of GII. For example, the conserved GI residues 45T, 46A, 47G, 48Q, 49V and 50N (Norwalk virus numbering) are also conserved in the bovine sequences (Fig. 6). Similarly, the alphatron sequences show more residue similarities with the GII sequences than with those of GI (Fig. 6). Such similarities are further confirmed in subsequent partitions P3–P10 of RECOVIR. Therefore, the software places the bovine-like sequences in GI and the alphatron-like sequences in the GII group. Although these assignments are consistent with our earlier analysis [46], they do not agree with recent sequence clustering studies that assign independent genogroups to the bovine and the alphatron sequences [40]. Experimental determination of the possible biological roles of the conserved residues (Fig. 6) will help resolve the issue.

Many of the RECOVIR database residues are involved in putative capsid related functions in noroviruses [46]. This indicates that these residues and the corresponding sequence groups may critically identify subtle changes in capsid related functions among the genogroups, species and serotypes. Hence, RECOVIR may provide functional insights not currently available with current classification schemes of caliciviruses and picornaviruses.

A limitation of the present version of the software is that the implemented residue databases may become outdated given the rapidly increasing number of emerging strains of calici and picornaviruses. Therefore, the databases need to be regularly upgraded. However, upgrades should be carried out only when changes become statistically significant. In our experience, this would mean that at least 50% of the emerging sequences must show changes in the same residue locations before those residues are considered reliable enough to be incorporated in the databases.

A software package RECOVIR that can efficiently and accurately characterize strains of the highly divergent caliciviruses and picornaviruses is described. In contrast to current techniques that largely rely on sequence similarity scores to identify strains, RECOVIR implements a method indicated previously for noroviruses [46]. The software creates partitioned databases of capsid residues that unambiguously distinguish among a large number of calicivirus and the picornavirus reference strains. Using efficient tree-based search techniques, residues of target capsid sequences are compared with the residue databases to rapidly identify strains of the target sequences of all caliciviruses and picornaviruses. Strains of more than 300 complete and partial capsid sequences of calici and picornaviruses were successfully identified with an average time (real time including I/O) of approximately 5 seconds per sequence. The method is general enough to be applicable to the nucleotide sequences of calici and picornavirus capsids thereby providing rapid and powerful alternatives that complement current strain determination and classification techniques of these and other ssRNA viruses.

Project home pages: http://prion.bchs.uh.edu/recovir

Operating systems: Windows-XP and Linux

• Programming language: Perl and Java

• Other requirements: X-Windows support (such as Cygwin) is needed for remotely running the program under Linux environment.

The author(s) declares that there are no competing interests.

DZ coded the software and designed the GUI. SC developed the concept, designed the algorithm and its implementation and created the initial databases. SC and DZ performed extensive troubleshooting with both synthetic and real data. SC and GEF wrote the paper. All authors suggested improvements at different stages of manuscript preparation and read and approved the final version of the manuscript.