dict.md logo

HCAD, closing the gap between breakpoints and genes

Recurrent chromosome aberrations are an important resource when associating human pathologies to specific genes. However, for technical reasons a large number of chromosome breakpoints are defined only at the level of cytobands and many of the genes involved remain unidentified. We developed a web-based information system that mines the scientific literature and generates textual and comprehensive information on all human breakpoints. We show that the statistical analysis of this textual information and its combination with genomic data can identify genes directly involved in DNA rearrangements. The Human Chromosome Aberration Database (HCAD) is publicly accessible at http://www.pdg.cnb.uam.es/UniPub/HCAD/.

In model systems, identifying and generating mutations is the usual genetic approach to understanding the function of individual genes. In humans, natural mutations, such as chromosome aberrations, are a comparable resource for genetic research, since DNA breakage and reciprocal recombination often lead to the fusion or deregulation of genes (13). Indeed, most human cancers (both leukaemias and solid tumours) and congenital disorders (including dysmorphology syndromes) display recurrent chromosome abnormalities. So far, the Mitelman database (http://cgap.nci.nih.gov/Chromosomes/Mitelman) constitutes the main effort to collect clinical and morphological data on cancer related chromosome aberrations (4). However, the Mitelman database contains relatively little molecular information and depends completely on manual curation.

We employ automatic text mining methods on PubMed abstracts to gather molecular and clinical facts for all recurrent and non-recurrent breakpoints described in human disorders. The complete and automatic identification of relevant abstracts from 10 million publications is possible, since aberration codes, for instance t(9;22)(q34;q11.2), are unambiguous. To facilitate the literature exploration of specific breakpoints, relevant information is compiled into textual and comprehensive overviews (the database is updated every 1–3 months).

For every breakpoint, we calculated statistically significant genes (5,6) and biomedical terms (e.g. disease names) that were mapped back onto their source sentences. In the web-based interface these genes and keywords serve as hyperlinks between information-rich sentences (see Figure 1). In this manner, the information for breakpoints becomes accessible as a navigable network that is intuitive and exhibits all the advantages of the Internet (7). As researchers can move between sentences taken directly from source abstracts, they can always retain control over the reliability and significance of the information. Diseases and associative verbs are also highlighted and hyperlinked within the text to further facilitate the perception of associations with human pathologies. Additional molecular information on breakpoints is accessible through links to external databases, such as GenBank, LocusLink and OMIM. The complete system, called HCAD (Human Chromosome Aberration Database), contains 737 breakpoints and 861 literature associated genes from 2082 cytogenetically different translocations and inversions (see Figure 2).

The HCAD system was also designed to assist the identification of potential breakpoint genes. This is a difficult task even though the complete human genome is now known, because of the sheer number of genes per cytoband (8). The premise behind HCAD is that genes directly affected by recurrent breakage events will be quoted more often in abstracts about the corresponding breakpoint, even if a direct proof for this association has not yet been described (see Figure 3). The statistical analysis in HCAD thus provides probabilities for genes to be relevant for a certain breakpoint (literature evidence). False positive associations of these predicted genes are eliminated by crosschecking their localization with genomic data (9). We found 343 of 861 literature associated genes to localize to recurrent breakpoints. Indeed, for one-third of these there are already clear experimental evidences that they are involved in fusion events. We believe that the HCAD information system provides a reliable basis for uncovering the role (10) of the remaining human genes in the context of chromosomal aberrations.