dict.md logo

RACE: Remote Analysis Computation for gene Expression data

DNA microarrays are standard tools in biology and medicine. An increasingly long list of applications includes the identification of gene expression changes associated with changes in cell state (1,2), classifying clinical samples based on the underlying pathological characteristics (3,4), drug development (5) and the functional annotation of genes (6). A typical microarray experiment might measure expression levels of tens of thousands of genes, and systematic variations introduced into the datasets (e.g. variations in labeling efficiencies or scanner settings) can often obscure the biological variation that is of real interest. Furthermore, once differentially expressed genes have been identified, inferring function based simply on their expression pattern can be both arduous and ineffective. Hence, bioinformatics tools that facilitate rigorous data analysis and interpretation are of the highest importance. Presented here is Remote Analysis Computation for gene Expression data (RACE), a web server which provides some solutions to these problems.

Microarray data analysis typically begins with data quality checks and data normalization (7). Once normalized expression levels are determined, expression ratios can be calculated and differentially expressed genes identified. At this stage of data analysis, the magnitude and significance of gene expression changes as well as the false discovery rate are important measures. Once a list of differentially expressed genes is identified, the task often turns to describing and interpreting the biological significance of the results. An often useful approach is to compile a list of the Gene Ontology (GO) terms associated with the differentially expressed genes (8). This provides an overview of the biological, physiological and cellular processes potentially involved in the biological phenomena and suggests directions for further studies.

A number of very useful web tools for microarray data analysis exist (e.g. 913). RACE contributes to the field by providing access to a wide range of quality checks, probe-level methods and state-of-the-art normalization techniques for Affymetrix raw data. To the best of our knowledge, these are not provided by any other publicly available server. Additionally, RACE provides tools to identify lists of differentially expressed genes and to determine and investigate the associated GO-term composition of those genes. To facilitate subsequent analyses and guarantee maximal transparency and reproducibility, the R script used to generate the results is provided.

RACE is divided into two components: the user interface and the analysis part. RACE uses basic authentication provided by Apache. HTTP communication is exclusively via port 80, making the system easily accessible through a firewall. Submitted jobs are queued and a customized analysis script is generated by a set of Perl scripts. The analysis script is executed in a subprocess.

All statistical analysis is performed using the free high-level interpreted statistical language R (R Core, 2004, http://www.R-project.org) and various Bioconductor packages (http://www.Bioconductor.org). The design of the software is modular to facilitate the addition of further analysis tools.

RACE can be used with an anonymous guest account but personal password-protected access is recommended. Registered users can store data in a personal account on the server, making it possible to run multiple tasks without the need to re-upload input files. Moreover, waiting times are avoided as the user is automatically emailed at the completion of a job. RACE creates for each job a directory for storing the input files, the selected parameters, the utilized R script and the results.

The upload files module allows users to upload and store files, decompress ZIP files and organize the data in different subdirectories. After setting the parameters in the analysis tools, the user is given the option of providing the input data either by a new upload or by copying or splitting previously uploaded or generated data.

The download files module allows users to access their password-protected directories, to browse their data and to download or delete files. Every file is deleted automatically by the system 1 week after its creation.

RACE currently offers three analysis tools accessible via the web interface, namely Data Quality Checks & Normalization, Statistical Tests and GO-term Analysis. Each tool is structured into three sections. The first section contains links to three help pages describing the purpose and implemented methods, the required input data format and the output files generated. Parameters which are required for the analysis are set in the second section. Parameters which can be optionally changed to customize the output files generated are set in the third section. At the bottom of the second section the user can provide the data to be analyzed.

After the submission of an analysis request, a confirmation message, including a link to the output page, is displayed. When the job is completed, authenticated users will receive the link to the output page by email. The output page contains the user data, the customized R script used for the analysis, all result files, ZIP archives and a log file which tracks job start and completion as well as problems that may have occurred during the run.

The Data Quality Checks & Normalization tool is dedicated to the visualization, quality checking and normalization of Affymetrix GeneChip data. Data should be provided as Affymetrix CEL files in ASCII format, optionally zipped.

The Data Quality Checks & Normalization tool uses primarily methods implemented in the Bioconductor packages ‘affy’ (14) and ‘affyPLM’. To quality check the perfect match (PM), probe levels are summarized in spatial and density plots. Individual probes in each probe set are numbered starting from the 5′ end of the transcript, and the mean 5′ to 3′ probe intensity bias for each array is determined. The probe-level intensities for probe sets are summarized to define a measure of the individual gene expression. To make data from different arrays comparable, RACE provides several normalization methods. The first of these is MAS 5.0, the current Affymetrix default algorithm. However, several studies (15,16) suggest that measures based only on the PM probes outperform the MAS 5.0 algorithm. For this reason RACE also provides access to two of the most prominent PM-based algorithms: RMA (Robust Multichip Average; 17) and gcRMA (see the Bioconductor website: http://www.Bioconductor.org). RMA includes quantile normalization and a robust multi-array probe-level fit, and gcRMA additionally exploits sequence information for the background adjustment. Based on the normalized expression values the Pearson correlation and the standard deviation of gene-wise expression differences between two arrays are calculated to evaluate similarities of the gene expression profile for each pair of samples. Moreover, a hierarchical sample cluster is built using Ward's minimum variance method.

The principle output of this tool is a file containing normalized gene expression levels. In addition, multiple data visualizations are provided to assist in judging the quality of the data and the success of the normalization.

Figure 1 shows two examples of the output typePLM pseudo images’ (see Table 1) displaying the spatial distribution of the residuals obtained from a probe-level fit over multiple arrays. High-quality data have characteristics similar to Figure 1a, which shows only a few small defects. In general, small defects do not seriously bias the expression levels, since probes representing one gene are distributed across the array and robust summary methods are used. However, extensive regions with large residuals—the dark regions seen in Figure 1b—are a clear indication of an experimental artifact (e.g in array production, hybridization or processing) and the array should be considered for exclusion from the analysis.

An example of the output type ‘Bias 5′ to 3′end plot’ is shown in Figure 2. Here, each line corresponds to an individual array. The graph is generated by calculating and plotting the array-wide mean intensities of ordered PM probe sets, where position 0 corresponds to the most 5′ probe and position 10 the most 3′ probe (the data are from an Affymetrix HGU133A array, whose probe sets each contain 11 PM probes). The slope and shape of each line is characteristic of each target sample and is dependent on the RNA sample source and the array type. When comparing expression data from a group of hybridizations, a sample whose slope and shape deviate significantly from the rest will often have anomalousoutlier’ results.

Owing to space limitations, the content and purpose of all other output graphs can be only briefly summarized in Table 1.

The Statistical Tests tool identifies genes which are differently expressed between two groups. The input files for this tool are two expression matrices provided as tab-delimited ASCII files. The first column of both files must contain unique gene identifiers and all other columns contain normalized expression values of the samples corresponding to the different groups. The input files can be generated on the server by splitting the output file ‘NormExprLevels.txt’ from the first tool into two groups.

The design of gene expression experiments can be represented in terms of a linear model (18). At the moment RACE supports designs where two groups are compared to identify genes changing expression across the groups. RACE uses the Bioconductor packagelimma’ (http://bioinf.wehi.edu.au/limma/usersguide.pdf), which makes use of an empirical Bayesian approach, to fit the linear model. This approach outperforms a conventional t-test under conditions typical for microarray experiments (1820). Owing to the large number of genes analyzed in a typical microarray experiment, an assessment of the effect of multiple testing is necessary. Therefore, we estimate from the distribution of raw p-values the fraction of the non-changing genes among all tested genes, as well as the false discovery rate (FDR) for each p-value threshold using the Bioconductor package ‘qvalue’ (21,22).

The principle output are lists of potentially differentially expressed genes chosen according to user-specified fold-change and p-value thresholds. A separate overview list containing all genes, complemented by statistical measures and additional gene annotations (e.g. GeneSymbol and LocusID), is also provided. RACE determines for each gene the fold-change, the logarithm of the fold-change (M), the mean expression level (A), the uncorrected p-value, the estimated FDR, the regularized t-value, the log odds ratio (B) and the standard deviations of the expression levels in each group. RACE provides multiple ways of visualizing these values. See Table 2 for an overview of the output graphs.

Figure 3 shows an example of the output type ‘p-Value histogram’ with an inset displaying the dependency between the FDR and the p-value. The p-value distribution is expected to be uniform if there are no differentially expressed genes. As the number of differentially expressed genes increases, the p-value distribution will show a more and more pronounced peak at small values. Figure 3 shows the output from a comparison of human testis and placenta RNA. A sharp, very high peak at small p-values is seen, indicating many highly significant expression differences between these two RNAs. By specifying a p-value and fold-change threshold, the user defines a candidate list of regulated genes. The inset shows the estimated FDR as a function of the p-value threshold.

Figure 4 shows the output type ‘MvA plot’ for the experiment in which human placenta and testis RNA were compared. Each point represents one gene. M is the log (base 2) of the fold-change in expression between testis and placenta, and A is the log average of the expression level. Very large expression differences over a wide range of expression intensities are seen. Genes which meet user-defined p-value and fold-change criteria are labeled in the output graph. Additionally, the selection criteria and the file name of the list which contains the labeled genes accompanied by annotations and statistical measures are presented in the output graph. In this example, the 100 most significant genes with fold-changes >2 (p < 0.01) have been selected.

The aim of the GO-term Analysis tool is to assist in the biological interpretation of gene lists by identifying functional annotations (GO terms) which are enriched among the user-provided input genes. Users can choose among the different ontology categories and GO-term levels and can select threshold combinations for list coverage (minimum number of genes corresponding to each GO term) and statistical significance (p-value) for the overrepresentation of each GO term. GO terms which meet these criteria are reported together with the corresponding genes. A tab-delimited file containing Affymetrix identifiers in one column is required as input. Optionally, another column may contain log ratios, which can then be used to analyze the GO terms according to the under- or overexpression of the genes being analyzed. Gene lists generated by the Statistical Tests tool can be used directly as input files.

GO (23) provides three structured, controlled vocabularies (ontologies) that describe gene products species-independently in terms of their associated biological processes, cellular components and molecular functions. GO terms are organized in directed acyclic graphs, representing networks where each term may be a ‘child’ (more specialized term) of one or more ‘parents’ (less specialized terms). The networks define the ‘is a’ or ‘part of’ relationships between terms and allow the grouping of all GO terms into different levels. As the GO term level increases, the informational specificity increases and the genome coverage decreases (24; also see http://www.geneontology.org/ for a more detailed description).

RACE uses the Bioconductor meta-data packages for the mappings of Affymetrix identifiers to LocusLink identifiers and of LocusLink identifiers to GO terms. GO-term levels are derived from the ‘gene_ontology.obo’ text file provided by the Gene Ontology Consortium. Based on the GO-term composition of all genes on the array used, a p-value is determined using a hypergeometric distribution for the overrepresentation of each GO term among the specified gene list. The ‘Gostats' Bioconductor package was used to implement this method. For more information, see http://Bioconductor.org/Docs/Papers/2003/Compendium/GOstats.pdf.

According to the user-specified parameters (GO-term type, GO-term specificity level, minimum number of genes annotated with a certain GO term, p-value threshold), a list of enriched GO terms is generated for the genes provided. For each enriched GO term, the numbers of supporting genes from the list as well as from the entire chip are reported and visualized. The counts of annotated and unannotated genes are reported as well. If the gene list corresponds to differentially expressed genes which are supplied with log ratios, the numbers of over- and underexpressed genes among the regulated genes are presented. To generate a ranking based on statistical significance, a p-value is calculated for the overrepresentation of GO-terms based on the hypergeometric distribution. The results are summarized in bar graphs and tables

Figure 5 shows such a GO-term bar chart for the experiment comparing human placenta and testis gene expression patterns. Different colors are assigned to up- and downregulated genes. The number of GO terms in the ‘biological function’ category significantly enriched in the group of differentially expressed genes is presented. Not surprisingly considering the source of the RNAs, the biological function GO terms ‘spermatogenesis’ (overexpressed in testis) and ‘pregnancy’ (underexpressed in testis) dominate the list.

RACE offers an easy to use collection of bioinformatics web tools to analyze DNA microarray data, without requiring any installation or maintenance on the user side. By using various R subroutines and Bioconductor packages, RACE provides users with access to powerful statistical analysis tools without the need for specific expertise in their use. It offers different users or laboratories the possibility of performing data QC, normalization and analysis in a standardized way, which is likely to lead to more consistent and reproducible results.