On Predicting Cancer using Machine Learning
One of the biggest health concerns that plagues our generation is Cancer. The chances of diagnosing and treating cancer improve when it is detected in early stages. With the rise in Machine Learning and Artificial Intelligence, there have been a few studies conducted to notice any patterns in the data and predict the possibility of Cancer. In this article we will look at one such data set and apply a simple machine learning algorithm to see if we can make a prediction.
We are going to leverage the Haberman’s Survival Data set that contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. The lymph nodes in the underarm called axillary nodes as the first-place breast cancer is likely to spread. The data suggests that there are other two features which were used in the study.
The data set has the following attributes — Age of patient, year of operation, # of positive axillary nodes and determines the survival status. Here are the first 10 data points from the set.
Our goal is to use the first 3 attributes to predict the survival status using a training set. This is clearly a classification problem. The popular algorithms to model a classification problem are Logistic regression, Decision Trees and Support Vector Machines (SVM). We have about 300 data points in this real-world example data consisting of 3 different attributes. Since we have a limited data set and performance is not an issue, I have chosen to use Decision Trees for this example as the relationship between the features seems non-linear with an initial inspection of data.
Here is the Python code that shows the use of a simple decision tree classifier. The code is pretty simple and it’s only to demonstrate the use of a classifier algorithm on a known data set. We will see the use of other ML algorithms in upcoming articles.
The output of running this program using a couple of test inputs is shown below. These test inputs were removed from the training set which was used to “fit” the model .
The outputs as expected in the known data set.
Conclusion: Decision Trees are pretty simple to use and easy to visualize and understand when looking at practical data sets. As the inputs get more complex (more number of features and higher data volume), we can use SVM for classification which will be shown in a future article.