Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction.

January 27, 2024 Bioinformatics, Computational Biology

Researchers

Junhao Liu Rui Wang Yawen Sun Yu-Juan Zhang Zeyu Luo Zongqing Chen

Journal

Briefings in bioinformatics

Modalities

Models

Deep Neural Networks Random Forest

Abstract

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs’ utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.Published by Oxford University Press 2024.

Show Full Text

Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction.

Researchers

Journal

Modalities

Models

Abstract

DTITR: End-to-end drug-target binding affinity prediction with transformers.

An introduction to deep learning on biological sequence data: examples and solutions.

ProtInteract: A deep learning framework for predicting protein-protein interactions.

Deep-Learning Algorithm and Concomitant Biomarker Identification for NSCLC Prediction Using Multi-Omics Data Integration.

The Effect of Resampling on Data-Imbalanced Conditions for Prediction towards Nuclear Receptor Profiling Using Deep Learning.

A comprehensive analysis of recent advancements in cancer detection using machine learning and deep learning models for improved diagnostics.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply