Recent Publication by Sanders-Brown Researchers Looks at Issue of Data Redundancy in Machine Learning

LEXINGTON, Ky. (Dec. 1, 2021) — Work by a group of researchers at the University of Kentucky’s Sanders-Brown Center on Aging was recently published in Genes. The article looks at the use of data mining and machine learning in research.

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (magnetic resonance imaging (MRI), biometrics, RNA expression, etc.) from Alzheimer’s disease cases and controls that have recently been used by machine learning algorithms to evaluate Alzheimer’s disease onset and progression. While using a variety of biomarkers is essential to Alzheimer’s disease research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models.

Justin Miller, Ph.D., assistant professor in the UK College of Medicine, directed this work through a collaboration with Mark Ebbert, Ph.D., assistant professor in the UK College of Medicine, and staff scientists Erik Huckvale and Matthew Hodgman. Together, they used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. Miller says through this work they found that greater than 90% of the biomarkers, gene expression data, and MRI data included in the ADNI dataset are very highly correlated with at least one other datatype, which could provide unforeseen challenges in using machine learning to identify patterns across the diverse data that are available in that dataset.

In this publication, Miller and his colleagues provide mappings of the highly correlated features so that future studies can consider this feature correlation and improve machine learning accuracy and efficiency in Alzheimer’s disease research.

“Feature correlation has always been an issue in large datasets, but it was previously unknown the extent to which this issue permeated the Alzheimer’s Disease Neuroimaging dataset,” said Miller. “This research will help improve data mining accuracy and efficiency in the ADNI dataset. Machine learning is a promising avenue of research to identify patterns that can one day improve patient care. This research lays the groundwork for those future analyses.”