Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.

Researchers

Agustín Gómez de la Cámara Carlos Sáez David Lora-Pablos Juan M García-Gómez Nekane Romero-Garcia Pablo Ferri Rafael Badenes Teresa García Morales

Journal

Computer methods and programs in biomedicine

Modalities

Models

Deep Multilayer Perceptron Gradient Boosting K-nearest neighbors logistic regression Random Forest

Abstract

Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values.We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors’ imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron.Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve.Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.Copyright © 2023. Published by Elsevier B.V.

Show Full Text

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.

Researchers

Journal

Modalities

Models

Abstract

An ensemble multi-stream classifier for infant needs detection.

Interdisciplinary approach to identify language markers for post-traumatic stress disorder using machine learning and deep learning.

Machine learning for image-based multi-omics analysis of leaf veins: a review.

Attention-based parallel networks (APNet) for PM spatiotemporal prediction.

Passenger Flow Forecasting in Metro Transfer Station Based on the Combination of Singular Spectrum Analysis and AdaBoost-Weighted Extreme Learning Machine.

Advances in Machine Learning Approaches to Heart Failure with Preserved Ejection Fraction.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply