Machine learning predicts diabetes risk in high-risk populations: analysis of National Health and Nutrition Examination Survey data

Physical Examination Center The Second Affiliated Hospital of Zhejiang University School of Medicine 88 Jiefang Road Shangcheng District Hangzhou, 310009, China Phone: 13906503158

DOI: https://doi.org/10.5114/aoms/209547

Article (PDF)

References (80)

KEYWORDS

diabetes

machine learning

National Health and Nutrition Examination Survey

prediction model

TOPICS

Diabetology

ABSTRACT

Introduction:
This project intended to develop and validate a diabetes prediction model for high-risk populations based on machine learning algorithms.

Material and methods:
A total of 2,355 samples from the National Health and Nutrition Examination Survey (NHANES) database covering three cycles from 2013 to 2018 were included. The data were divided into training and testing sets in a 7:3 ratio. Nineteen risk prediction factors were selected as feature variables, including demographic baseline data, measurement data, medical history, and psychological health. Five machine learning models – decision tree, random forest (RF), multilayer perceptron (MLP), Adaboost, and Extreme Gradient Boosting (XGBoost) – were developed based on the data and variables mentioned above. Model performance was evaluated using accuracy, sensitivity, specificity, the area under curve (AUC) values of receiver operating characteristic (ROC) curves, and Matthews Correlation Coefficient (MCC) scores. Finally, the Shapley feature importance measurement tool was employed to select features in the optimal model.

Results:
The present work ultimately included 2,355 individuals at high risk of diabetes for analysis, with 260 cases of diabetes and 2,095 cases without diabetes. Among the five machine learning models established in this project., the RF and XGBoost models exhibited better overall performance compared to other models. In the test set, the RF model had an AUC of 0.896, accuracy of 0.784, sensitivity of 0.739, specificity of 0.849, and MCC of 0.418. The XGBoost model had corresponding values of AUC as 0.903, accuracy of 0.815, sensitivity of 0.962, and MCC of 0.443. According to the importance analysis of features in these two optimal models, waist circumference, age, BMI, gender, systolic blood pressure (SBP), diastolic blood pressure (DBP), education level, poverty income ratio (PIR), Patient Health Questionnaire (PHQ)-9 score, and race were the top ten key risk factors for diabetes in the high-risk population.

Conclusions:
The RF and XGBoost machine learning models demonstrated strong performance in predicting the occurrence of diabetes in high-risk populations. These models can aid in developing more precise intervention measures and personalized treatment plans to effectively reduce the incidence of diabetes and related risks in this population.

REFERENCES (80)

World Health Organization. Diabetes. 2023. Available from: https://www.who.int/news-room/....

eISSN:	1896-9151
ISSN:	1734-1922