Predicting dual primary tumors in patients diagnosed with first-episode breast cancer (BC) is crucial. This can assist physicians’ evaluation of treatment decisions. We applied eight machine learning algorithms to the BC data from the Surveillance, Epidemiology, and End Results program (SEER) database and evaluated the best model for predicting dual primary BC to help physicians assess patient prognoses.

Material and methods:
Machine learning models were established based on the retrospective study of 253,991 patients diagnosed with first-episode BC in the SEER database from 2010 to 2015. External validation was conducted on 6012 cases obtained through undersampling from the SEER database from 2004 to 2009. The decision tree (DT) and random forest (RF) models were employed using ten-fold cross-validation and grid search.

Surgical information, lymph-node status, distant metastasis, tumor size, survival time, and histological type had significant influence as inputs. Compared with those of the other seven models (multinomial naïve bayes, logistic regression, k-nearest neighbor, one-dimensional convolutional neural network, recurrent neural network, long short-term memory, and DT), the accuracy of the RF model increased from 63.25 to 97.19%, whereas its precision, recall, F1 score, and area under the curve (AUC) increased from 62.92 to 95.01%, 64.36 to 99.48%, 63.63 to 97.19%, and 63.25 to 97.10%, respectively. RF was the only model where the AUC increased (0.24%) under external verification, which shows its excellent portability and generalization in the validation cohort.

The RF model can be used to predict dual primary BC and assist physicians with the diagnosis and treatment of first-episode BC patients.