DIABETOLOGY / RESEARCH PAPER
Machine learning predicts diabetes risk in high-risk populations: based on the National Health and Nutrition Examination Survey database
More details
Hide details
1
The Second Affiliated Hospital of Zhejiang University School of Medicine, China
Submission date: 2025-06-11
Final revision date: 2025-07-30
Acceptance date: 2025-08-14
Online publication date: 2025-09-08
Corresponding author
Ting Sun
The Second Affiliated Hospital of Zhejiang University School of Medicine, China
KEYWORDS
TOPICS
ABSTRACT
Introduction:
This project intended to develop and validate a diabetes prediction model for high-risk populations based on machine learning algorithms.
Material and methods:
A total of 2,355 samples from the National Health and Nutrition Examination Survey (NHANES) database covering three cycles from 2013 to 2018 were included. The data were divided into training and testing sets in a 7:3 ratio. Nineteen risk prediction factors were selected as feature variables, including demographic baseline data, measurement data, medical history, and psychological health. Five machine learning models, including decision tree, random forest (RF), multilayer perceptron (MLP), Adaboost, and XGBoost,
Results:
The present work ultimately included 2,355 individuals at high risk of diabetes for analysis, with 260 cases of diabetes and 2,095 cases without diabetes. Among the five machine learning models established in this project, the RF and XGBoost models exhibited better overall performance compared to other models. In the test set, the RF model had an AUC of 0.896, accuracy of 0.784, sensitivity of 0.739, specificity of 0.849, and MCC of 0.418. The XGBoost model had corresponding values of AUC as 0.903, accuracy of 0.815, sensitivity of 0.962, and MCC of 0.443. According to the importance analysis of features in these two optimal models, waist circumference, age, BMI, gender
Conclusions:
The RF and XGBoost models in machine learning demonstrate good performance in predicting the occurrence of diabetes in high-risk populations, which can aid in developing more precise intervention measures and personalized treatment plans to effectively reduce the incidence of diabetes and related risks in this population.