Construction of the training and test sets

The Protein Data Bank structures, grouped into clusters comprising entries that share at least 50% sequence identity, were obtained from https://www.rcsb.org/pages/download/ftp (bc-50.out file). Structures longer than 700 or shorter than 30 residues, with resolution greater than 2.5 Å, and solved by methods other than X-ray crystallography were discarded. This yielded a set comprising 205,527 structures grouped within 24,276 clusters. Secondary structure state was assigned to each residue of each structure using DSSP7. Considering the weak performance of DSSP in detecting π-helices, we used a custom π-helix assignment method implemented based on1. First, all “I” labels (π-helices) defined by DSSP were substituted with “C” (coils). Next, residue ranges were re-marked as π-helical if (i) at least two π-type hydrogen bonds were present according to DSSP (in the cases where DSSP indicated two alternative hydrogen bonds, e.g. i → i + 4 and i → i + 5, the one with lower energy was considered), (ii) at least one of the π-type hydrogen bonds had energy equal or lower than −2.0 kcal/mol, and (iii) torsion angles for all residues in the range fell into the broadly defined helical region (−180° < φ < 0°, −120° < Ψ < 45°). Finally the 8-state assignment was reduced to a 4-state assignment (“H” – helix, containing “G” and “H” labels, “S” – strand, containing “E” and “B” labels, “C” – coil, containing “S”, “T”, and “C” labels, and “I” – π-helix). An analogous approach was used to generate an independent assignment of π/α-bulges, i.e. secondary structure elements comprising just a single i → i + 5 hydrogen bond. Such six-residue-long regions were defined based on the presence of one π-type hydrogen bond of energy equal or lower than −2.0 kcal/mol and torsion angles fulfilling the criteria listed above.

To build a dataset in which the pairwise sequence identity does not exceed 50%, we selected a representative structure from each of the aforementioned 24,276 clusters. To this end, we used the following procedure: First, from each cluster that does not contain any structure with a π-helix and/or a π/α-bulge, a single representative structure with the lowest resolution was selected. From the remaining clusters, representative structures with the lowest resolution and longest π-helical segment were selected; however, those containing also π/α-bulges were discarded. Such a procedure resulted in an initial dataset comprising 22,510 structures, of which 2,985 contained at least one canonical π-helix and did not contain any π/α-bulge (positive examples set), and 19,525 did not contain π-helices as well as π/α-bulges (negative examples set). From the remaining 1,750 structures, those containing at least one π/α-bulge but no π-helices were selected, filtered to 30% sequence identity, and used to generate a separate “π/α-bulge” set comprising 449 structures.

Sequences corresponding to the 22,959 structures (2,985 positive examples set, 19,525 negative examples set, and 449 “π/α-bulge” set) were used as queries in PSI-BLAST29 (E-value < 0.001, three iterations) searches of the NCBI non-redundant protein sequence database filtered to 90% sequence identity for the calculation of position specific scoring matrices (PSSMs). Subsequently, the 22,510 structures of the positive and negative examples sets were merged, shuffled, and randomly assigned to the training and test sets comprising 20,295 and 2,215 sequences, respectively. Importantly, we ensured that all sequences of the test set show no more than 30% similarity to any sequence of the training set. Furthermore, we also ensured that the two datasets contained an equal percentage of the π-helical residues, which amounted to 0.46% of all residues in each set. Detailed statistics for all data sets are shown in Supplementary Table 1.

Sequence encoding

For encoding sequences and their corresponding PSSM profiles, we used a procedure in which a sequence is encoded as a 700 × 40 matrix, where 700 and 40 corresponds to the maximal sequence length and the number of features associated with every residue, respectively. Out of 40 features, 20 denoted “one-hot” encoded amino acid and another 20 were the PSSM probabilities transformed by the sigmoid function. Finally, if the sequence was shorter than 700 residues it was padded with zeros randomly at the C- or N- terminal ends to match the 700 × 40 matrix size.

Deep-learning model architecture and training

The cascaded convolutional and recurrent network architecture, based on the DCRNN secondary structure predictor16, with minor modifications, was implemented in Keras30 using Tensorflow31 backend (Fig. 6). Sequences and PSSMs encoded as 700 × 40 matrices (see “Sequence encoding” section above for details) were independently introduced into three 1D convolutional layers (window lengths 3, 5 and 7), each of which contains 64 filters and is activated with the tanh function. The output of the three convolutional layers, i.e. three 700 × 64 matrices, were passed through batch normalization layers and concatenated to yield a 700 × 192 matrix. Next, the concatenated matrix was used as an input to a fully-connected layer containing 200 neurons and activated with the ReLU function. To detect the dependencies between distant residues based on the local features extracted by convolutional layers two bidirectional LSTM layers were used. Each consisted 200 neurons and their dropout and recurrent dropout parameters were set to 0.5, and activation and recurrent activation functions were set to tanh and sigmoid, respectively. Finally, a dense layer with 200 neurons and the ReLU activation function was used to connect the LSTM with an output layer containing 4 neurons and the softmax activation function. The final output of the network is a vector 700 × 4 indicating the residue-wise probabilities of the 4 secondary structure classes: I – π-helix, H – α-helix, E – β-strand, C – unstructured region. The implementation in Python and Keras is available at: https://github.com/labstructbioinf/PiPred.

Figure 6

Schematic representation of the prediction model architecture. Numbers below matrices indicate their dimensionality. The input sequence and the corresponding PSSM are encoded as 700 × 40 matrix, which is introduced into three independent convolutional layers with window lengths of 3, 5, and 7 (for clarity only windows of the length of 3 and 5 are shown in green and yellow, respectively). The output of each convolutional layer is normalized and subsequently, all outputs are merged and passed through a dense layer, two LSTM layers, another dense layer, and an output layer. The output is a 700 × 4 matrix where each row denotes probabilities of E (strand), H (α-helix), I (π-helix), and C (unstructured coil) occurring at the given position of the input sequence. For details refer to “Deep-learning model architecture and training” section of Methods.

The training process involved optimization of the network’s parameters using pairs of encoded sequences and their corresponding correct secondary structure labels. The training process was performed in a 10-fold cross-validation (CV) framework: the training set was randomly divided into 10 equally-sized parts, each containing approximately the same number of π-helical residues. In each CV round, one part served as validation set, whereas the remaining nine together as training set. The training was performed for 50 epochs with the ‘Adam’ optimizer32 (the learning rate was set to 0.0003, whereas the remaining parameters were set to their default values) with categorical cross-entropy as the loss function (to account for significant underrepresentation of π-helices, the π-helix class was weighted by a factor of five). From each CV round, the best model (according to the F1-score of π-helix classification) was selected and the resulting 10 models were used to build the final ensemble predictor (PiPred). For each residue of a given sequence and PSSM, PiPred returns four probabilities corresponding to four secondary structure classes (I, H, E, and C). In each position of the input sequence, the predicted secondary structure is defined as the one with the highest probability.

Function of π-helices

22,510 structures of the initial dataset (positive and negative examples) were used to identify associations between the presence of π-helices and biological functions. The PDB to Gene Ontology (GO) mappings were downloaded from www.geneontology.org/gene-associations/goa_pdb.gaf33 and 11,066 out of 22,510 structures were found to have at least one associated GO term. These structures were analyzed with GOATOOLS34 to identify the enrichment of GO terms in 1,773 structures containing one or more π-helices (p-values were adjusted with the Holm method). In addition to the GO enrichment statistics, we analyzed 2,555 π-helices present in structures containing ligands using PLIP35.

Differential geometry analyses

A differential geometry representation of protein backbones was used to evaluate and compare the geometry of (i) π-helices correctly predicted as π-helices, (ii) α-helices correctly predicted as α-helices, and (iii) α-helices mispredicted as π-helices (false positives). To this end, we used the FleXgeo36 method, which implements an approach analogous to that used in CHORAL37, ARABESQUE38, and Polyphony39. The protein backbone is represented by a piecewise cubic spline interpolation using the Cα atoms as knots, i.e. as a regular smooth curve $$\vec{r}(t)$$ parametrized by Cα residue number $$t$$. According to the Fundamental Theorem of Curves, any regular spatial curve can be fully characterized by its curvature, $$\kappa$$, and torsion, $$\tau$$, values as a function of arc length $$s$$. Given the allowable change of parameters, $$s$$ and $$t$$ are related by:

$$\frac{ds}{dt}=\Vert \frac{d\vec{r}}{dt}\Vert$$

(1)

Therefore, it is possible to calculate $$\kappa \,$$and $$\tau$$ using the following equations:

$$\kappa =\vec{\kappa }=\Vert \frac{d\vec{T}}{ds}\Vert =\frac{\Vert \dot{r}\times \ddot{r}\Vert }{\Vert {\dot{r}}^{3}\Vert }$$

(2)

$$\tau =\Vert \vec{\tau }\Vert =\Vert \frac{d\vec{B}}{ds}\Vert =\frac{\Vert (\dot{r}\times \ddot{r})\cdot \dddot{r}\Vert }{{\Vert \dot{r}\times \ddot{r}\Vert }^{2}}$$

(3)

where, $$\vec{T}$$ is the tangent vector and $$\vec{B}$$ is the binormal vector of the Frenet-Serret frame of the curve. Each residue can then be represented by its respective Cα $$\kappa$$ and $$\tau$$ values. The curvature values express how much a given point of a curve deviates from a straight line in comparison to the previous point, i. e. $$\kappa =0$$ only if the point does not change the tangent vector $$\vec{T}$$ of the curve. Similarly, the torsion expresses how a given point deviates from a plane in comparison to the previous point, i. e. $$\tau =0$$ only if the point does not change the binormal vector $$\vec{B}$$ of the curve. The units of $$\kappa$$ and $$\tau$$ are per Å−1.

Based on the per-segment classification (Fig. 2B), we defined 53 α-helices mispredicted as canonical π-helices (comprising seven or more residues), 127 π-helices correctly predicted as π-helices (only predicted π-helical segments comprising seven or more residues were considered), and 127 α-helices correctly predicted as α-helices (out of >12,000 correctly predicted α-helices 127 were randomly selected to ensure the balance). These helical segments were analyzed using the differential geometry procedure and the resulting curvature as well as torsion values were averaged for each segment and plotted (Fig. 4A).

Correction of the CB6133, CB5926, CB513, CASP10, and CASP11 datasets

CB6133 and CB51319 are standard datasets used in the development of computational tools for protein secondary structure prediction. The CB6133 dataset is composed of training and test subsets, and therefore can be used on its own in the development process. Alternatively, a CB6133 dataset variant, CB6133_filtered, obtained by removing sequences that exhibit >25% sequence identity to sequences in the CB513 dataset, can be used for training and the CB513 dataset for validation. Recently, after the discovery of duplicates in the original CB6133 dataset, an updated version of it (CB5926) as well as an updated version of its filtered dataset (CB5926_filtered) were released.

As these datasets were constructed using older DSSP assignments, which did not annotate π-helices accurately, they show a low abundance of π-helices (Supplementary Table 2). Consequently, as most secondary structure prediction methods train their models using these datasets, they do not predict π-helices. We therefore decided to correct the annotation of π-helices within these datasets. To this end, we downloaded datasets CB5926, CB5926_filtered, CB6133, CB6133_filtered, and CB513 from https://www.princeton.edu/~jzthree/datasets/ICML2014/ and extracted sequences for all entries. Since the sequences did not contain the corresponding PDB identifiers, we built a sequence database corresponding to crystallographic (resolution less than 2.5 Å) and NMR structures (preference was given to X-ray structures if multiple entries were matched). Subsequently, each sequence from the five datasets was used to search our custom PDB database with BLAST and ideal matches, spanning whole query sequence range, were sorted according to their resolution. In each case, the match with the lowest resolution was used to generate updated secondary structure labels with the aid of the procedure described in “Construction of the training and test sets”.

To investigate the effects of these corrections, we re-trained two state-of-the-art secondary structure prediction methods, CNNH_PSS17 and DCRNN16. Both were trained and tested using the original CB5926_filtered and CB513 datasets, respectively, and their updated variants with corrected secondary structure labels. For testing, in addition to the CB513 dataset, we also used the test set developed for the benchmarking of PiPred as well as CASP10 and CASP11 datasets. Testing for the significance of the difference between the performance of PiPred and the other methods was performed as in40 using paired t-test; F1 scores were calculated for the individual sequences containing π-helices.

Source