Improving taxonomic classification with feature space balancing.

Researchers

Journal

Modalities

Models

Abstract

Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision.The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr.Supplementary data are available at Bioinformatics Advances online.© The Author(s) 2023. Published by Oxford University Press.

Show Full Text

Improving taxonomic classification with feature space balancing.

Researchers

Journal

Modalities

Models

Abstract

A classification-based approach to semi-supervised clustering with pairwise constraints.

SegVeg: Segmenting RGB Images into Green and Senescent Vegetation by Combining Deep and Shallow Methods.

Randomness assisted in-line holography with deep learning.

Learning in the machine: The symmetries of the deep learning channel.

CoDC: Accurate Learning with Noisy Labels via Disagreement and Consistency.

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply