|

A protein pre-trained model-based approach for the identification of the liquid-liquid phase separation (LLPS) proteins.

Researchers

Journal

Modalities

Models

Abstract

Liquid-liquid phase separation (LLPS) regulates many biological processes including RNA metabolism, chromatin rearrangement, and signal transduction. Aberrant LLPS potentially leads to serious diseases. Therefore, the identification of the LLPS proteins is crucial. Traditionally, biochemistry-based methods for identifying LLPS proteins are costly, time-consuming, and laborious. In contrast, artificial intelligence-based approaches are fast and cost-effective and can be a better alternative to biochemistry-based methods. Previous research methods employed word2vec in conjunction with machine learning or deep learning algorithms. Although word2vec captures word semantics and relationships, it might not be effective in capturing features relevant to protein classification, like physicochemical properties, evolutionary relationships, or structural features. Additionally, other studies often focused on a limited set of features for model training, including planar π contact frequency, pi-pi, and β-pairing propensities. To overcome such shortcomings, this study first constructed a reliable dataset containing 1206 protein sequences, including 603 LLPS and 603 non-LLPS protein sequences. Then a computational model was proposed to efficiently identify the LLPS proteins by perceiving semantic information of protein sequences directly; using an ESM2-36 pre-trained model based on transformer architecture in conjunction with a convolutional neural network. The model could achieve an accuracy of 85.86 % and 89.26 %, respectively on training data and test data, surpassing the accuracy of previous studies. The performance demonstrates the potential of our computational methods as efficient alternatives for identifying LLPS proteins.Copyright © 2024. Published by Elsevier B.V.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *