dict.md logo

Analyzing the simplicial decomposition of spatial protein structures

After we performed a rigorous re-structuring of the whole PDB on graph-theoretical basis, we created the RS-PDB (Rich-Structure PDB) database. Using this cleaned and repaired database, we defined simplicial complexes on the heavy-atoms of the PDB, and analyzed the tetrahedra for geometric properties.

We have found surprisingly characteristic differences between simplices with atomic vertices of different types, and between the atomic neighborhoodsdescribed also by simplices – of different ligand atoms in proteins.

The information stored in the Protein Data Bank [1] would make possible fully automated in silico studies if mislabeled chemical groups, broken protein- and nucleic acid chains and other errors were corrected. Even today, the newly submitted data is verified "by hand" by human experts. In an earlier work, we applied a rigorous cleaning and re-structuring procedure for the entries in the Protein Data Bank [2], and created the RS-PDB database. We made use of non-trivial mathematical, mainly graph-algorithms: Computing the InChIcode [3,4] applied a graph-isomorphism testing, transforming aromatic notation to Kekule-notation used a non-bipartite graph-matching algorithm [5], breadth-first-search graph traversals [6] were used throughout the work [2], depth-first search [6] was used in building the ligand molecules and identifying ring structures, kd-trees [7] were applied for computing covalent bonds, and hashing [6] were utilized for the fast generation of protein-sequence ID's.

The resulting RS-PDB database is capable to serve intricate structural queries on all the three-dimensional protein structures known to mankind.

It is of basic importance to map the physico-chemical properties of protein-ligand binding sites, most importantly the Coulomb and Van der Waals forces, in order to predict protein-ligand binding, to design ligands for a given binding site on the surface on a protein, or in designing inhibitors or activators in enzymatic mechanisms. The exact description of the forces in question are deep quantum-chemical problems. The atomic environment of the binding sites clearly has strong effect to these forces; consequently, by examining the atomic environments of the ligands in the crystallographically verified protein-ligand complexes in the PDB would yield insight in binding mechanisms and biologically active molecule design. The first step in this direction need to be the analysis of the simplicial structures of the atoms, forming the protein structures themselves. The second step is the analysis of simplicial neighborhoods of the ligand atoms.

In the present work we define a certain simplicial decomposition on the heavy atoms of the protein structures in the PDB, and analyze some geometrical properties of the tetrahedra of different atomic composition. By this way we – first time in the literaturesucceeded in defining a structure capable to answer topological questions concerning the distribution of volume and shape of heavy protein-atoms in the whole PDB. One of our main results is the identification of the volume-shape relation of tetrahedra of distinct atomic composition.

Even the refined, cleaned RS-PDB database [2] lacks important features, such as easy acceptance of queries such as: What atoms surround a certain (ligand- or protein-) atom in the structure? Which atoms are neighbouring with the atom/amino acid X in the protein? How many ligand-atoms are surrounded by exactly the tetrahedron with C-C-C-O atoms in its vertices? How frequent are the tetrahedra with vertices C-C-O-N? Are there differences in the shape of tetrahedra of different composition?

Note, that such queries cannot be answered from the amino-acid sequence of the protein, since they intrinsically depend on the tertiary structure of the protein. Consequently, one need to use some cleaned version of the PDB as the initial data.

We have chosen Delaunay decomposition in the discretization of the dataset in the RS-PDB database, since in this "tessellation", the tetrahedra are close to regular ones, and it is a natural and well defined notion, with a well-known algorithm for the generation of the tessellation.

Definition 1 Given a finite set of points A R3, and a H A such that the points of H are on the surface of a sphere and the sphere does not contain any further points of A, then the convex hull of H is called a Delaunay region.

Delaunay regions define a partition of the convex hull of A. If the points of A are in general position, (i.e., no five of the points are on the surface of a sphere), then all regions are tetrahedra.

Singh, Tropsha and Vaisman [8] applied Delaunay decomposition to protein-structures as follows: they selected A to be the set of Cα atoms of the protein, and analyzed the relationship between Delaunay regions volume and "tetrahedrality" and amino acid order in order to predict secondary protein structure.

They gave the following definition:

Definition 2 ([8]) The tetrahedrality of the tetrahedron with edge-lengths 1, ℓ2, ℓ3, ℓ4, ℓ5, ℓ6 is defined

4 ( ∑ k ℓ k ) 2 ∑ i < j ( ℓ i − ℓ j ) 2 15

where i is the length of edge i.

Note, that the tetrahedrality of the regular tetrahedron is 0.

In what follows A R3 is always a subset of the atoms of a protein, preferably heavy-atoms (i.e., non-hydrogen atoms) or just the Cα atoms.

To find the Delaunay decomposition of a set, the qhull algorithm was used (the implementation source is available at: http://www.qhull.org/[9]).

Our complete test set was selected from the RS-PDB by the following criteria: the entry need to contain at least one protein, with no missing atoms, and the resolution of the structure has to be at least 2.2 Å. We have found 5,757 such entries in the RS-PDB database.

Figure 1 shows the decomposition for the PDB entry 10gs.

In contrast with the article [8], we have taken A to be the set of heavy atoms of the 5757 proteins. Note that in that case we cannot assume that points are in general position, as for example in a (perfect) benzene ring at least 6 carbon atoms lie on a sphere. However, we have found that – probably due to both imprecision of data in the PDB and minor perturbations in atomic positions – all regions are tetrahedra. In our test we – instead of examining the distribution of volume and tetrahedrality of regions separately – created density maps in both variables at the same time. The triple logarithmic plot can be seen on Figure 2. It is quite straightforward to see that at the boundary of the protein the tetrahedra tend to be more irregular and of larger volume, while in the inside of the protein, the tetrahedra are small, compact, and regular (see Figure 1). However, the more intricate analysis depicted on Figure 2 shows a distinctly characteristic distribution. One of our main results is the identification of regions of the plot of Figure 2, strictly characteristic to the vertex-composition of the tetrahedra involved.

After that we examined tetrahedra grouped according to the set of atoms in their vertices. All tetrahedra were assigned a label that is the merging of the 4 symbols assigned with the elements in the corners in alphabetic order. (For example a tetrahedra spanned by a nitrogen, two carbon atoms and an oxygen would be assigned the symbol: C_C_N_O_. Grouped by these labels, we listed the count of the tetrahedra in Table 1.

We observed that splitting the density plot according to the composition of the vertex-sets of the Delaunay tetrahedra would show different patterns for different labels. This is one of our main results, depicted on Figure 3.

Here we analyze the atomic environments of ligand atoms, bound to proteins. The atomic environment of each ligand atom will be identified as the vertices of a tetrahedron in a tetrahedral decomposition of the heavy atoms of the protein, containing the atom of the bound ligand.

By this approach we can describe uniformly and in a discreet manner the environment of ligand atoms in proteins. The classification is given by describing tetrahedra according to the atoms in their vertices, and by the atoms of the ligands the convex hull these tetrahedra contain (Figure 4). One of our main results is the statistical analysis of the frequencies of the separate ligand atoms in different types of tetrahedra, formed from protein atoms in Table 2 and Table 3.

We are using the ligand-identification technique described in [2], using the classification of monomer ID's given in [10] and [11]. Concisely, we doubly checked if a ligand, even with more than one monomer ID's is one molecule or not, by comparing the bond tables from mmCIF and the atomic distances. The ligand was thrown out if recognized as a crystallization artifact, covalently bound (but non-protein-) or junk molecule [10].

In this work we prepared the simplicial decomposition of 5,757 protein structures, chosen from the Protein Data Bank by quality criteria such as every atom has coordinate (i.e., there are no missing atoms) and the resolution of the structure is at least 2.2 Å. The heavy atoms (that is, non-hydrogen atoms) of the structures were decomposed into Delaunay regions using the qhull algorithm [9]. Next we depicted the tetrahedrality/volume relation in a triple logarithmic plot (Figure 2), and also counted the tetrahedra of different vertex-sets in Table 1. We found that tetrahedra with different atoms in their vertices populate different areas of the plot of Figure 2: Figure 3 gave our results. Figure 3 shows, that data-points, corresponding to tetrahedra of a given atomic composition assume well-characterizable positions in Figure 2. This result show the spatial preferences in tetrahedra of distinct composition in protein structures. By further exploring this avenue methods may appear in helping in silico protein folding studies. We also used the RS-PDB database [2] for finding crystallographically verified ligands in our test-set of 5,757 proteins. Next the tetrahedra, containing the atoms of these ligands were collected and given in Tables 2 and 3. We believe that these large-scale data will help in in silico identifying ligand-binding preferences in inhibitor design and in ligand binding prediction.

The authors declare that they have no competing interests.

Rafael Ördög designed and prepared the simplicial database, analyzed it with the triple-logarithmic plots of Figure 2, and Figure 3, and analyzed the data of tetrahedra of different atomic types and ligands. Zoltán Szabadka designed and prepared the RS-PDB database, including the cleaning methods, and helped the discretization. Vince Grolmusz initiated the simplicial decomposition of the protein spatial data, lead the work and wrote the paper.