DIABETOLOGY / BASIC RESEARCH
Machine learning predicts diabetes risk in high-risk populations: analysis of National Health and Nutrition Examination Survey data
More details
Hide details
1
Nursing Department, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
2
Physical Examination Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
Submission date: 2025-06-11
Final revision date: 2025-07-30
Acceptance date: 2025-08-14
Online publication date: 2025-09-08
Corresponding author
Ting Sun
Physical Examination Center
The Second Affiliated
Hospital
of Zhejiang University School
of Medicine
88 Jiefang Road
Shangcheng District
Hangzhou, 310009, China
Phone: 13906503158
KEYWORDS
TOPICS
ABSTRACT
Introduction:
This project intended to develop and validate a diabetes prediction model for high-risk populations based on machine learning algorithms.
Material and methods:
A total of 2,355 samples from the National Health and Nutrition Examination Survey (NHANES) database covering three cycles from 2013 to 2018 were included. The data were divided into training and testing sets in a 7:3 ratio. Nineteen risk prediction factors were selected as feature variables, including demographic baseline data, measurement data, medical history, and psychological health. Five machine learning models – decision tree, random forest (RF), multilayer perceptron (MLP), Adaboost, and Extreme Gradient Boosting (XGBoost) – were developed based on the data and variables mentioned above. Model performance was evaluated using accuracy, sensitivity, specificity, the area under curve (AUC) values of receiver operating characteristic (ROC) curves, and Matthews Correlation Coefficient (MCC) scores. Finally, the Shapley feature importance measurement tool was employed to select features in the optimal model.
Results:
The present work ultimately included 2,355 individuals at high risk of diabetes for analysis, with 260 cases of diabetes and 2,095 cases without diabetes. Among the five machine learning models established in this project., the RF and XGBoost models exhibited better overall performance compared to other models. In the test set, the RF model had an AUC of 0.896, accuracy of 0.784, sensitivity of 0.739, specificity of 0.849, and MCC of 0.418. The XGBoost model had corresponding values of AUC as 0.903, accuracy of 0.815, sensitivity of 0.962, and MCC of 0.443. According to the importance analysis of features in these two optimal models, waist circumference, age, BMI, gender, systolic blood pressure (SBP), diastolic blood pressure (DBP), education level, poverty income ratio (PIR), Patient Health Questionnaire (PHQ)-9 score, and race were the top ten key risk factors for diabetes in the high-risk population.
Conclusions:
The RF and XGBoost machine learning models demonstrated strong performance in predicting the occurrence of diabetes in high-risk populations. These models can aid in developing more precise intervention measures and personalized treatment plans to effectively reduce the incidence of diabetes and related risks in this population.
REFERENCES (80)
2.
Shin J, Kim J, Lee C, et al. Development of various diabetes prediction models using machine learning techniques. Diabetes Metab J 2022; 46: 650-7.
3.
Liu T, Zhao J, Lin C. Sprouty-related proteins with EVH1 domain (SPRED2) prevents high-glucose induced endothelial-mesenchymal transition and endothelial injury by suppressing MAPK activation. Bioengineered 2022; 13: 13882-92.
4.
Collaborators GBDD. Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet 2023; 402: 203-34.
5.
Vornanen M, Konttinen H, Peltonen M, Haukkala A. Diabetes and cardiovascular disease risk perception and risk indicators: a 5-year follow-up. Int J Behav Med 2021; 28: 337-48.
6.
Adriaanse MC, Twisk JW, Dekker JM, et al. Perceptions of risk in adults with a low or high risk profile of developing type 2 diabetes; a cross-sectional population-based study. Patient Educ Couns 2008; 73: 307-12.
7.
Naina Marikar S, Al-Hasani K, Khurana I, et al. Pharmacological inhibition of human EZH2 can influence a regenerative beta-like cell capacity with in vitro insulin release in pancreatic ductal cells. Clin Epigenetics 2023; 15: 101.
8.
Walther F, Heinrich L, Schmitt J, Eberlein-Gonska M, Roessler M. Prediction of inpatient pressure ulcers based on routine healthcare data using machine learning methodology. Sci Rep 2022; 12: 5044.
9.
Gong Q, Zhang P, Wang J, et al. Morbidity and mortality after lifestyle intervention for people with impaired glucose tolerance: 30-year results of the Da Qing Diabetes Prevention Outcome Study. Lancet Diabetes Endocrinol 2019; 7: 452-61.
10.
Lynch CJ, Liston C. New machine-learning technologies for computer-aided diagnosis. Nat Med 2018; 24: 1304-5.
11.
Johnson KW, Torres Soto J, Glicksberg BS, et al. Artificial intelligence in cardiology. J Am Coll Cardiol 2018; 71: 2668-79.
12.
Myszczynska MA, Ojamies PN, Lacoste AMB, et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat Rev Neurol 2020; 16: 440-56.
13.
Deberneh HM, Kim I. Prediction of type 2 diabetes based on machine learning algorithm. Int J Environ Res Public Health 2021; 18: 3317.
14.
Olusanya MO, Ogunsakin RE, Ghai M, Adeleke MA. Accuracy of machine learning classification models for the prediction of type 2 diabetes mellitus: a systematic survey and meta-analysis approach. Int J Environ Res Public Health 2022; 19: 14280.
15.
Hu H, Lai T, Farid F. Feasibility study of constructing a screening tool for adolescent diabetes detection applying machine learning methods. Sensors (Basel) 2022; 22: 6155.
17.
Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 2003; 26 Suppl. 1: S5-20.
18.
Genuth S, Alberti KG, Bennett P, et al. Follow-up report on the diagnosis of diabetes mellitus. Diabetes Care 2003; 26: 3160-7.
19.
Hajian-Tilaki K, Heidari B, Hajian-Tilaki A. Are gender differences in health-related quality of life attributable to sociodemographic characteristics and chronic disease conditions in elderly people? Int J Prev Med 2017; 8: 95.
20.
Zhu C, Zhang H, Shen Z, et al. Cystatin C-based estimated GFR performs best in identifying individuals with poorer survival in an unselected Chinese population: results from the China Health and Retirement Longitudinal Study (CHARLS). Clin Kidney J 2022; 15: 1322-32.
21.
Patel JS, Oh Y, Rand KL, et al. Measurement invariance of the patient health questionnaire-9 (PHQ-9) depression screener in U.S. adults across sex, race/ethnicity, and education level: NHANES 2005-2016. Depress Anxiety 2019; 36: 813-23.
22.
Ferguson JM, Jacobs J, Yefimova M, Greene L, Heyworth L, Zulman DM. Virtual care expansion in the Veterans Health Administration during the COVID-19 pandemic: clinical services and patient characteristics associated with utilization. J Am Med Inform Assoc 2021; 28: 453-62.
23.
Walker EA, Mertz CK, Kalten MR, Flynn J. Risk perception for developing diabetes: comparative risk judgments of physicians. Diabetes Care 2003; 26: 2543-8.
24.
Sakhuja S, Jaeger BC, Akinyelure OP, et al. Potential impact of systematic and random errors in blood pressure measurement on the prevalence of high office blood pressure in the United States. J Clin Hypertens (Greenwich) 2022; 24: 263-70.
25.
Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med 2001; 16: 606-13.
26.
Li W, Zeng L, Yuan S, et al. Machine learning for the prediction of cognitive impairment in older adults. Front Neurosci 2023; 17: 1158141.
27.
Haque UM, Kabir E, Khanam R. Early detection of paediatric and adolescent obsessive-compulsive, separation anxiety and attention deficit hyperactivity disorder using machine learning algorithms. Health Inf Sci Syst 2023; 11: 31.
29.
Mohammed M, Munir M, Aljabr A. Prediction of date fruit quality attributes during cold storage based on their electrical properties using artificial neural networks models. Foods 2022; 11: 1666.
30.
Ullah Z, Saleem F, Jamjoom M, Fakieh B. Reliable prediction models based on enriched data for identifying the mode of childbirth by using machine learning methods: development study. J Med Internet Res 2021; 23: e28856.
31.
Bavaro DA, Fanizzi A, Iacovelli S, et al. A machine learning approach for predicting capsular contracture after postmastectomy radiotherapy in breast cancer patients. Healthcare (Basel) 2023; 11: 1042.
32.
Ferre F, Laurent R, Furelau P, et al. Perioperative risk assessment of patients using the MyRISK digital score completed before the preanesthetic consultation: prospective observational study. JMIR Perioper Med 2023; 6: e39044.
33.
Chen W, Zhang L, Cai G, et al. Machine learning-based multimodal MRI texture analysis for assessing renal function and fibrosis in diabetic nephropathy: a retrospective study. Front Endocrinol (Lausanne) 2023; 14: 1050078.
34.
Liu X, Morelli D, Littlejohns TJ, Clifton DA, Clifton L. Combining machine learning with Cox models to identify predictors for incident post-menopausal breast cancer in the UK Biobank. Sci Rep 2023; 13: 9221.
35.
Yu S, Zhang M, Ye Z, Wang Y, Wang X, Chen YG. Development of a 32-gene signature using machine learning for accurate prediction of inflammatory bowel disease. Cell Regen 2023; 12: 8.
36.
Inceoglu F, Deniz S, Yagin FH. Prediction of effective sociodemographic variables in modeling health literacy: a machine learning approach. Int J Med Inform 2023; 178: 105167.
37.
Riveros Perez E, Avella-Molano B. Learning from the machine: is diabetes in adults predicted by lifestyle variables? A retrospective predictive modelling study of NHANES 2007-2018. BMJ Open 2025; 15: e096595.
38.
Qian G, Jiaxin H, Minghua C, et al. Rapid identification of tumor patients with PG-SGA ≥ 4 based on machine learning: a prospective study. BMC Cancer 2025; 25: 902.
39.
Zhang Y, Zhang X, Razbek J, et al. Opening the black box: interpretable machine learning for predictor finding of metabolic syndrome. BMC Endocr Disord 2022; 22: 214.
40.
Qi J, Lei J, Li N, et al. Machine learning models to predict in-hospital mortality in septic patients with diabetes. Front Endocrinol (Lausanne) 2022; 13: 1034251.
41.
Chu WM, Tsan YT, Chen PY, et al. A model for predicting physical function upon discharge of hospitalized older adults in Taiwan-a machine learning approach based on both electronic health records and comprehensive geriatric assessment. Front Med (Lausanne) 2023; 10: 1160013.
42.
Rus Prelog P, Matic T, Pregelj P, Sadikov A. A pilot predictive model based on COVID-19 data to assess suicidal ideation indirectly. J Psychiatr Res 2023; 163: 318-24.
43.
Obagbuwa IC, Danster S, Chibaya OC. Supervised machine learning models for depression sentiment analysis. Front Artif Intell 2023; 6: 1230649.
44.
Asnake AA, Gebrehana AK, Asebe HA, et al. Application of machine learning algorithm for prediction of abortion among reproductive age women in Ethiopia. Sci Rep 2025; 15: 17924.
45.
Sam S. Differential effect of subcutaneous abdominal and visceral adipose tissue on cardiometabolic risk. Horm Mol Biol Clin Investig 2018; 33.
46.
Joshi RD, Dhakal CK. Predicting type 2 diabetes using logistic regression and machine learning approaches. Int J Environ Res Public Health 2021; 18: 7346.
47.
Meshram, II, Vishnu Vardhana Rao M, Sudershan Rao V, Laxmaiah A, Polasa K. Regional variation in the prevalence of overweight/obesity, hypertension and diabetes and their correlates among the adult rural population in India. Br J Nutr 2016; 115: 1265-72.
48.
De Tata V. Age-related impairment of pancreatic beta-cell function: pathophysiological and cellular mechanisms. Front Endocrinol (Lausanne) 2014; 5: 138.
49.
Hernandez-Bautista RJ, Alarcon-Aguilar FJ, Del CE-VM, et al. Biochemical alterations during the obese-aging process in female and male monosodium glutamate (MSG)-treated mice. Int J Mol Sci 2014; 15: 11473-94.
50.
Lee JH, Lee J. Endoplasmic reticulum (ER) stress and its role in pancreatic beta-cell dysfunction and senescence in type 2 diabetes. Int J Mol Sci 2022; 23: 4843.
51.
Par F, Sarvi F, Khodadost M, Pezeshki B, Doosti H, Tabrizi R. A nonlinear association of body mass index and fasting blood glucose: a dose-response analysis from fasa adults cohort study (FACS). Health Sci Rep 2025; 8: e70560.
52.
Poulsen K, Cleal B, Clausen T, Andersen LL. Work, diabetes and obesity: a seven year follow-up study among Danish health care workers. PLoS One 2014; 9: e103425.
53.
Ng ACT, Delgado V, Borlaug BA, Bax JJ. Diabesity: the combined burden of obesity and diabetes on heart disease and the role of imaging. Nat Rev Cardiol 2021; 18: 291-304.
54.
Skudder-Hill L, Sequeira IR, Cho J, Ko J, Poppitt SD, Petrov MS. Fat distribution within the pancreas according to diabetes status and insulin traits. Diabetes 2022; 71: 1182-92.
55.
Al-Mrabeh A, Hollingsworth KG, Shaw JAM, et al. 2-year remission of type 2 diabetes and pancreas morphology: a post-hoc analysis of the DiRECT open-label, cluster-randomised trial. Lancet Diabetes Endocrinol 2020; 8: 939-48.
56.
Mezuk B, Eaton WW, Albrecht S, Golden SH. Depression and type 2 diabetes over the lifespan: a meta-analysis. Diabetes Care 2008; 31: 2383-90.
57.
Rubin RR, Ma Y, Marrero DG, et al. Elevated depression symptoms, antidepressant medicine use, and risk of developing diabetes during the diabetes prevention program. Diabetes Care 2008; 31: 420-6.
58.
Kammer JR, Hosler AS, Leckman-Westin E, DiRienzo G, Osborn CY. The association between antidepressant use and glycemic control in the Southern Community Cohort Study (SCCS). J Diabetes Complications 2016; 30: 242-7.
59.
Russell LE, Tse J, Bowie J, et al. Cooking behaviours after Diabetes Prevention Program (DPP) participation among DPP participants in Baltimore, MD. Public Health Nutr 2023; 26: 2492-7.
60.
Crandall JP, Dabelea D, Knowler WC, Nathan DM, Temprosa M, Group DPPR. The diabetes prevention program and its outcomes study: NIDDK’s journey into the prevention of type 2 diabetes and its public health impact. Diabetes Care 2025; 48: 1101-11.
61.
Murteira R, Cary M, Galante H, Romano S, Guerreiro JP, Rodrigues AT. Effectiveness of a collaborative diabetes screening campaign between community pharmacies and general practitioners. Prim Care Diabetes 2023; 17: 314-20.
62.
Wei GS, Coady SA, Goff DC, et al. Blood pressure and the risk of developing diabetes in african americans and whites: ARIC, CARDIA, and the framingham heart study. Diabetes Care 2011; 34: 873-9.
63.
Cho NH, Kim KM, Choi SH, et al. High blood pressure and its association with incident diabetes over 10 years in the Korean Genome and Epidemiology Study (KoGES). Diabetes Care 2015; 38: 1333-8.
64.
Menke A, Casagrande S, Geiss L, Cowie CC. Prevalence of and trends in diabetes among adults in the United States, 1988-2012. JAMA 2015; 314: 1021-9.
65.
Odlum M, Moise N, Kronish IM, et al. Trends in poor health indicators among black and hispanic middle-aged and older adults in the United States, 1999-2018. JAMA Netw Open 2020; 3: e2025134.
66.
Shen L, Song L, Li H, et al. Association between earlier age at natural menopause and risk of diabetes in middle-aged and older Chinese women: The Dongfeng-Tongji cohort study. Diabetes Metab 2017; 43: 345-50.
67.
Mauvais-Jarvis F, Manson JE, Stevenson JC, Fonseca VA. Menopausal hormone therapy and type 2 diabetes prevention: evidence, mechanisms, and clinical implications. Endocr Rev 2017; 38: 173-88.
68.
Schmid SM, Hallschmid M, Schultes B. The metabolic burden of sleep loss. Lancet Diabetes Endocrinol 2015; 3: 52-62.
69.
Cappuccio FP, D’Elia L, Strazzullo P, Miller MA. Quantity and quality of sleep and incidence of type 2 diabetes: a systematic review and meta-analysis. Diabetes Care 2010; 33: 414-20.
70.
Spiegel K, Leproult R, L’Hermite-Baleriaux M, Copinschi G, Penev PD, Van Cauter E. Leptin levels are dependent on sleep duration: relationships with sympathovagal balance, carbohydrate regulation, cortisol, and thyrotropin. J Clin Endocrinol Metab 2004; 89: 5762-71.
71.
Taheri S, Lin L, Austin D, Young T, Mignot E. Short sleep duration is associated with reduced leptin, elevated ghrelin, and increased body mass index. PLoS Med 2004; 1: e62.
72.
Hibi M, Kubota C, Mizuno T, et al. Effect of shortened sleep on energy expenditure, core body temperature, and appetite: a human randomised crossover trial. Sci Rep 2017; 7: 39640.
73.
Borrell LN, Dallo FJ, White K. Education and diabetes in a racially and ethnically diverse population. Am J Public Health 2006; 96: 1637-42.
74.
Hanprathet N, Lertmaharit S, Lohsoonthorn V, Rattananupong T, Ammaranond P, Jiamjarasrangsi W. Increased risk of type 2 diabetes and abnormal fpg due to shift work differs according to gender: a retrospective cohort study among Thai workers in Bangkok, Thailand. Diabetes Metab Syndr Obes 2019; 12: 2341-54.
75.
Suwazono Y, Dochi M, Sakata K, et al. A longitudinal study on the effect of shift work on weight gain in male Japanese workers. Obesity (Silver Spring) 2008; 16: 1887-93.
76.
Allen K, McFarland M. How are income and education related to the prevention and management of diabetes? J Aging Health 2020; 32: 1063-74.
77.
Dinca-Panaitescu S, Dinca-Panaitescu M, Bryant T, Daiski I, Pilkington B, Raphael D. Diabetes prevalence and income: results of the Canadian Community Health Survey Health Policy 2011; 99: 116-23.
78.
Ludwig J, Sanbonmatsu L, Gennetian L, et al. Neighborhoods, obesity, and diabetes – a randomized social experiment. N Engl J Med 2011; 365: 1509-19.
79.
Gaskin DJ, Thorpe RJ Jr., McGinty EE, et al. Disparities in diabetes: the nexus of race, poverty, and place. Am. J Public Health 2014; 104: 2147-55.
80.
Okwechime IO, Roberson S, Odoi A. Prevalence and predictors of pre-diabetes and diabetes among adults 18 years or older in Florida: a multinomial logistic modeling approach. PLoS One 2015; 10: e0145781.