Empirical Study of Protein Feature Representation on Deep Belief Networks Trained with Small Data for Secondary Structure Prediction.
Researchers
Journal
Modalities
Models
Abstract
Protein secondary structure (SS) prediction is a classic problem of computational biology and is widely used in structural characterization and to infer homology. While most SS predictors have been trained on thousands of sequences, a previous approach had developed a compact model of training proteins that used a <b>C<\b>-<b>A<\b>lpha, C-<b>B<\b>eta <b>S<\b>ide Chain (<b>CABS</b>)-algorithm derived energy based feature representation. Here, the previous approach is extended to Deep Belief Networks (DBN). Deep learning methods are notorious for requiring large datasets and there is a wide consensus that training deep models from scratch on small datasets, works poorly. By contrast, we demonstrate a simple DBN architecture containing a single hidden layer, trained only on the CB513 dataset. Testing on an independent set of G Switch proteins improved the Q<sub>3</sub> score of the previous compact model by almost 3%. The findings are further confirmed by comparison to several deep learning models which are trained on thousands of proteins. Finally, the DBN performance is also compared with <b>P</b>osition <b>S</b>pecific <b>S</b>coring <b>M</b>atrix (<b>PSSM</b>)-profile based feature representation. The importance of (i) structural information in protein feature representation and (ii) complementary small dataset learning approaches for detection of structural fold switching are demonstrated.