Julia Huang and Justin Huang, Northville High School, Northville, Michigan, USA
Abstract
According to the National Cancer Institute, 39.5% of individuals globally will be diagnosed with cancer. Among all cancers, breast, kidney, liver, lung, and thyroid are the most common, with hundreds of thousands of cases annually. As cancer is both a common and life-threatening disease worldwide, scientists pose a question of how best to predict, detect, and diagnose cancers. Recently, microRNAs (or tiny noncoding miRNAs) have been discovered to be critical biomarkers for detecting cancers, as they result in different expressions in normal and cancerous tissues. Biomarkers are levels of gene expressions or molecules that indicate a normal process or a disease. Because a certain set of these miRNAs have been linked to specific cancers, we developed, trained, and optimized two different robust deep-learning algorithms, Random Forest and XGBoost, to perform miRNA feature selection by first including significant miRNAs then classifying between tumor and control types on miRNA datasets for quick and convenient identification of cancers. We performed studies on six different cancer types and experimented with both complex and simple versions of machine learning models, which are computer-programmed algorithms that utilize artificial intelligence to automatically learn patterns and information from input data in order to solve a particular task, similar to how humans solve problems. Experimental results demonstrate a 95.536% state-of-the-art average classification accuracy across all seven datasets for both models, proving both the miRNA’s useful indication of cancer and the promise of deep-learning approaches for future research into robust and generalized prediction mechanisms of cancer. Our process utilizing the R language, including datasets, models, and results, are available at: https://github.com/TheClassicTechno/miRNACancerClassification.
Introduction
Our paper discusses machine learning methods to aid in the diagnosis of various cancers, a disease that affects humans globally. According to the National Cancer Institute, approximately 39.5% of individuals will be diagnosed with cancer at some point ("National Cancer Institute," n.d.). Cancer is a type of disease where cells in living beings multiply rapidly and uncontrollably and spread to other cells. As cancer is both a common and life-threatening disease, there poses a question of how best to predict, detect, and diagnose cancers.
Currently, one of the most prevalent treatments for cancer is surgery on the body part that contains cancer: this way, most of the cancerous cells can be cut off from the rest of the body and prevent them from turning healthy cells cancerous. Other common treatments include radiation therapy (or radiotherapy), where beams of energy are used to damage DNA of cancer cells, and chemotherapy, where chemicals are used to kill cancerous cells (“Mayo Clinic”, 2020; “Mayo Clinic”, 2022). However, these treatments can be invasive and expensive, and some can also cause side effects. For example, surgery may only cut out the most infected part(s) of a patient’s body and miss parts that have a bit of cancerous cells in it, so later this patch may grow larger again and therefore the patient would need to schedule another surgery session. In addition, both radiotherapy and chemotherapy can cause hair loss, fatigue, and changes in memory. Therefore, these cons warrant another, more suitable method to be found.
In the last decade, microRNAs, or single-stranded RNA molecules that control the expression of genes of organisms, have been discovered to be reliable indicators of cancer due to their differing gene expressions in tumor versus normal cells and altered profiles in each type of cancer (Di Leva et al., 2013; Bartel, 2004; Mendell & Olson, 2012). In addition, some miRNAs suppress tumors, some act as oncogenes that transform a normal cell into a cancerous one, and others deregulate gene expressions in human cancers (Lopez-Rincon et al. 2019; “Genome”, n.d.). But, most importantly, after discovering that mir15 and mir16 are both downregulated in chronic leukemia, Calin et al. first established the relationship between miRNAs and cancers (Calin et al., 2002). These observations have led to growing interest in experimenting with miRNAs as biomarkers for diagnosing cancer.
However, it has been difficult to identify the most critical miRNAs for cancer detection, as only they play crucial roles in cancer formation. This raises the challenge of only identifying the relevant miRNAs, or, in other words, the minimal number of miRNAs to measure for cancer detection, to reduce the amount of computation and effort. Hence, utilizing machine learning-based methods for both miRNA data processing and feature selection may greatly improve the identification of important miRNA biomarkers and serve as a more convenient way to detect cancers.
Materials and Methods
In this work, we propose a robust deep-learning methodology that achieves two objectives: eliminating the non-significant circulating miRNAs and utilizing models to detect 6 different cancer types. Our process, which is detailed in Figure 1, relies on miRNA datasets as input data, software libraries to build deep-learning classifiers and an environment to train, modify, and evaluate deep-learning architectures. Our methods are effective and comparable to a variety of state-of-the-art techniques (Lopez-Rincon et al., 2020; Rehman et al., 2019; Rosenfeld et al., 2008; Cheerla & Gevaert, 2017).
The GDC Cancer Portal from the National Cancer Institute (“GDC”, n.d.) is a platform that offers numerous types of data for a variety of cancers for bioinformaticians and researchers alike, e.g., lincRNA, miRNA, protein-coding, etc. From this Portal, we sourced miRNA .txt text files that contain miRNA data for tumor and control samples for 6 different cancer types, breast, head-and-neck, lung, liver, kidney, and thyroid, as they are among the most common cancers.
Each cancer file contains row headers of different miRNA names (e.g., hsa-miR-1253) and column headers of Tumor_# and Control_# (normal), where Tumor and Control cells with # represent each patient number ranging from 39 to 86 patients per file. In the body of each file include zeros and positive numbers representing the gene expression of each miRNA in each Tumor or Control cell in numerical value. We then loaded each cancer data file into the RStudio integrated development environment (IDE) that is running on a Windows 10 laptop, Version 10.0.19044 to use the R programming language and its data processing and machine learning libraries to train deep learning models with miRNA data.
However, we only want to use statistically significant miRNAs (p-value < 0.05) for input into our models, so we performed feature selection and t-tests (which involve comparing mean values of each miRNA biomarker) for each .txt file to only retrieve the critical miRNA biomarkers that are the most effective at discriminating tumor and normal tissues (Hayes, 2022). Lastly, to train each architecture, we used 70/30, 75/25, and 80/20 train-test splits on significant miRNAs, meaning that 70%, 75%, and/or 80% of the numerical data is used to train the model, and the trained model is tested on the remaining 30%, 25%, and/or 20%.
To test the capabilities of deep learning, we developed two well-known architectures, Random Forest and XGBoost, to perform binary classification to distinguish gene expressions of tumor versus control cells of patients for each of the 6 cancers.
A classifier that combines multiple produced decision trees, Random Forest (RF) classifies data by averaging the class predictions produced by all the trees. (In our research, the two classes needed to be predicted are tumor and normal.) At each node of each tree, RF performs a random selection of features from input data, and a binary split is performed on that node. This process is repeated until a minimum node size is reached [10].
XGBoost, or Extreme Gradient Boosting, is a type of decision tree that is also composed of multiple trees, similar to the RF architecture. However, XGBoost utilizes a methodology called gradient boosting, which is where the process of combining weak models with other weak trees to generate a stronger model is set as a gradient descent algorithm to reduce errors. In all, XGBoost iteratively trains a collection of low-performing decision trees to improve the next model's accuracy until the final classification output is calculated as a weighted sum of all the tree predictions (“NVIDIA Data Science Glossary,” n.d.).
Our proposed XGBoost and Random Forest frameworks are built using the R programming language’s built-in XGBoost and RF libraries and trained on miRNA data in RStudio. Our process involves no overhead purchases, making our method both convenient and accessible to other researchers and software developers. The code of our process can be found in our GitHub repo at https://github.com/TheClassicTechno/miRNACancerClassification.
To train these architectures and compare their performances against different versions of themselves and each other, we modified the following parameters: the number of decision trees (ntrees) in RF and the number of iterations for boosting (nrounds) in XGBoost (“XGBoost Parameters,” n.d.). To especially evaluate their robustness and generality, we trained both models on all six different cancer files listed above.
Figure 1. Our proposed step-by-step process for using miRNA biomarkers and machine learning methods to detect cancers.
Results
To display our models’ accuracies listed in Figure 2, Figure 3, and Figure 4 below in an organized way, we utilized Google Sheets charts, using two Random Forest and two XGBoost models as the column headers and the names of the cancer files used as the row headers. As seen in Figure 5, the XGBoost model on average produced higher performance than Random Forest. This could be because while RF uses bagging to build full decision trees from random data samples and averages all decision tree predictions no matter if they are inaccurate, XGBoost carefully improves singular weak trees by combining them with other weak ones to improve their performances so they don't "drag down" the overall XGBoost performance. In addition, XGBoost fits the next tree by using the error residuals of the previous one to improve the trees' learning (“NVIDIA Data Science Glossary,” n.d.). The train-test-split values do not seem to have as noticeable of a difference in performance, as they differ in the hundredth place on accuracy. Nevertheless, the 80% 20% split resulted in the highest accuracy, which could be because this split provided the models with the highest number of training samples to work with and therefore helped improve the architectures’ learning even further. When looking at case-by-case values, it seems like the head-and-neck and lung cancer files received the lowest accuracies on average, most likely because they have the least number of patients (43 and 39) and therefore fewer data examples for the models to learn from. On the other hand, the breast cancer files resulted in the highest accuracies because they have the greatest number of patients (50 and 86).
Figure 2. Full chart of binary classification accuracies, in decimal format, of each machine learning model version (Random Forest and XGBoost) for 6 cancers using a train-test-split of 80% 20%, meaning 80% of data is for training the model and 20% is for testing. The x-axis displays the cancer types while the y-axis displays the accuracies of the models on a scale from 0 to 1 (1 being 100% accurate). Most architectures reached 90% and over.
Figure 3. Full chart of binary classification accuracies, in decimal format, of each model version for 6 cancers using a train-test-split of 70% 30%.%. Similarly, the x-axis displays the cancer types while the y-axis displays the accuracies of the models on a scale from 0 to 1 (1 being 100% accurate). Most architectures reached 90% and over, but the overall performances are slightly lower than the previous graph with a 0.8 0.2 tumor-normal split.
Figure 4. Full chart of binary classification accuracies, in decimal format, of each model version for 6 cancers using a train-test-split of 75% 25%. Most architectures reached 90% and over, but the overall performances are slightly higher than the previous graph with a 0.7 0.3 tumor-normal split.
Figure 5. Summary chart of averaged accuracies of each train-test-split value and model. On average, a more extreme train-test split that heavily favors more data for training received higher model performances.
Discussion
Early detection is critical for cancer patients to receive proper treatment to improve health outcomes. In this work, we utilized differentiating gene expressions from biomarkers in tumor and control cells to train two versatile machine learning frameworks to classify each of the six cancer types, five of them which are the most common kinds. Despite the dissimilar cancer types and different gene expressions of miRNA biomarkers for each cancer, both our proposed XGBoost and Random Forest architectures distinguished between control and tumor type expressions robustly, resulting in accuracies over 95% for all six cancers.
Our architectures were easily built, quickly trained, and tested (the process takes less than 30 seconds for each model), and evaluated on minimal and low-cost resources, making our process accessible and easy to set up for anyone with a laptop and access to the internet while also achieving consistent state-of-the-art (SOTA) results in machine learning research for different model versions and six different cancers’ miRNA expressions. We suspect further optimization and retraining of our two models can help improve binary classification accuracies even more. Future direction includes deep learning model deployment into an IoT device for testing in real time. Our research proves that utilizing machine learning to aid in cancer detection can be both easy to set up and robust. Linking biomarker expression to cancer prediction through deep-learning methodologies to distinguish between the gene expressions of cancer and non-cancerous tissues offers a more convenient and quick solution for potential cancer patients. As early detection is critical for cancer patients to receive proper treatment, our robust, consistently accurate results prove that machine learning can be a practical solution for easier and faster cancer diagnosis for patients worldwide, potentially saving millions.
References
Comentarios