|

Deep Learning-Based Imbalanced Data Classification for Drug Discovery.

Researchers

Journal

Modalities

Models

Abstract

Drug discovery studies have become increasingly expensive and time-consuming processes. In the early phase of drug discovery studies, an extensive search has been performed to find drug-like compounds, which then can be optimized over time to become a marketed drug. One of the conventional ways of detecting active compounds is to perform an HTS (high-throughput screening) experiment. As of July 2019, the PubChem repository contains 1.3 million bioassays that are generated through HTS experiments. This feature of the PubChem makes it a great resource for performing machine learning algorithms to develop classification models to detect active compounds for drug discovery studies. However, datasets obtained from the PubChem are highly imbalanced. This imbalance nature of the datasets has a negative impact on the classification performance of machine learning algorithms. Here, we explored the classification performance of deep neural networks (DNN) on imbalance compound datasets after applying various data balancing methods. We used five confirmatory HTS bioassays from the PubChem repository and applied one under-sampling and three over-sampling methods as data balancing methods. We used a fully connected, two-hidden-layer DNN model for the classification of active and inactive molecules. To evaluate the performance of the network, we calculated six performance metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient, and area under the ROC curve. The study results showed that the effect of imbalanced data on network performance could be mitigated to a degree by applying the data balancing methods. The level of imbalancedness, however, has a negative effect on the performance of the network.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *