AIxFP: CNN Pneumonia Detector
Pneumonia is an infection that inflames your lungs' air sacs (alveoli). The air sacs may fill up with fluid or pus, causing symptoms such as a cough, fever, chills and trouble breathing.
Pneumonia symptoms can vary from so mild you barely notice them, to so severe that hospitalization is required. How your body responds to pneumonia depends on the type germ causing the infection, your age and your overall health.
In developed countries, pneumonia is the 6th main cause of death, with an incidence between 7 and 15 cases for every 1,000 people per year.
To diagnose pneumonia, the most commont technique after visiting a doctor, is to carry out a chest x-ray, which is a 2 dimensions image of the ue es una imagen 2D de los lungs using x-ray. When the x-ray beam cross the body, it absorbs different quantities of material according to the density of the material throught which the beam cross, for instance, dense materials such as bones and metals, appear in white in the radiography, however air appears black. Since healthy lungs are full of air, healthy radiography are mostly black, however since liquids present a higher density than air, when the alveoli are full of liquid (which is a consequence of pneumonia), there are particular areas in the lungs that appear white, sometimes is very clear to see, but other times.. it is not:
Typically, practise and experience of medical imaging professionals makes them able to see in a radiography what most of us are not able to see at naked eye, being able to diagnose with confidence if a radiography has signs of pneumonia or not. On the other hand, the workflow of these very qualified healthcare professionals is so tight and busy, that sometimes even them miss a diagnose, not only that, the correct analysis of only one radiography consumes time and resources, since often various specialist must check it.
That is a reason why Artificial Intelligence algorithms are being applied to read radiographies, not to substitute health professionals, but to:
- reduce the volume of work through screening programs to speed up diagnosis
- hightlight new features that may not be recognizable to the naked eye
- to support diagnosis
And the good news is that.. this doesn't need to be a complicated algorithm!! Of course there are companies which work is to develop this kind of algorithms, and they reach a sophistication that we can not do here in this exercise. However, we are going to make work a Convolutional Neural Network (CNN) que will helps us to distinguis if a chest x-ray signals pneumonia or not.
Before starting, just wanted to let you know that this notebook is based in another notebook available at Kaggle (this one), to which we all can access freely. I encourage to do it, check Kaggle and download a notebook that does something interesting to you and investigate how it works.
Kaggle is a platform where you can find all types of datasents (images, tabular data, etc) as notebooks already done. In addition it has many interesting competitions. It is a great resource to start learning data science and artificial intelligence!!
First, we start importing libraries, our toolkit 🛠
link text # 1. How is our data?... Exploratory Data Analysis (EDA)
Alwas, always, always and ALWAYS first thing we have to do with our toolkit is to know about the data we have, learn what are the main characteristics and what is particular about it. You will find the EDA acronym many times when talking about data science or AI, because is a very important step, many times the most important step. EDA refers to the exploration of our data and can imply the transformation of our data in order to make sense of it, also taking into account there is no outliers or mistakes that may spoils or model (look at the meme for instance!).
In most cases, if we download a dataset from a platform like Kaggle, the data is tagged, meaning that if we download a dataset about dogs and cats, every image is likely to have a tag saying it is a cat or a dog or whatever. When we work with this kind of data, we name it *Supervised Learning, if the data does not have this kind of tags, then we are most like working with unsupervised data, but we will talk about that another day.
In our case, we are very lucky that the original dataset already has tags in the images:
- "Normal": radiography without pneumonia
- "Pneumonia": radiography with pneumonia
1.1 Train, test, validation sets
Before training a model, it is very important to divide the data. Why? Easy! Because if we train an algorith with images, we do not want these images repeat when we test the accuracy of our model, since if that happens we will be cheating to ourselves, that is why we divide the data in 3 sets:
- Train (training): contains the data / images which will train our model.
- Validation: contains the data / images that we will use to validate our model. The goal of this set is to avoid Overfitting, that means to prevent the model to learn by heart the images of the "Train" dataset instead of learning to inference the right classification with new images that has not seen (read) before.
- Test: here are the data / images that we will use to test the model once has been trained (NEVER BEFORE!), because we do not want that the model sees the test images before due time and memorize those.
¿How we divide the data? Typically, first we divide the dataset in Train and Test, not touching the test dataset until the end, while dividing the Train dataset into 80% for Train and 20% for Validation (these percentages are not mandatory, but use to be a good distribution or rule of thumb). Then we train the model with that 80% Train dataset and valide the model with the Validation dataset, that is what we call Cross Validation.
For actually doing it, we can use many different tools, the most usual is to use already existing libraries (tools) like
Nevertheless, in this particular problem, the Kaggle dataset is already distributed into Train, Test and Validation datasets, so another thing we do not have to worry this time.
1.2 Load the Data
1.2.1 In this exercise, we have the data ready to download and unzip from Google Drive to Colab.
1.2.2 Another way to obtain the data, is to directly download the dataset from the Kaggle repository and then upload the data here. --> if you execute the former cell, do not execute this
1.2.3 Now that we have downloaded the datasets and are available in the notebook, we assign the data to variables (kind of multipurpose drawers) to manage them.
1.3. Let's see some of the images that we have "charged".
The above images are generated randomly from the dataset. As you can see, there are difference between them, so let's create an artificial intelligence (a CNN) that can determine what are the difference and is able to classify them in 2 categories: normal and pneumonia.
But.. before continuing, what is and how it works a CNN to classify images?
To answer this questions, we must answer another questions: how can a computer see?
The way we, mammals and almost any other living being with eyes and a developed brain see, is something like: 1. Light beams are reflexed by an object that hit your retina. 2. This perception input is carried to your brain. 3. Then the brain, after interpreting the input, decides what sees.
Therefore, since you are a child, the world teaches what things are, teachs you to tag everything you see. You know an elephant is an elephant because you look at it and someone tells you that object in an animal called elephant, then your brain records that association and you do not need to learn it again.
Computers however, can not learn as we do. Computers need to "look" and analyse thousands and thousands of images before being able to generalize and say that an elephant belongs to the category mammal animals, like a lion. This is due to what a computer see is not an image, but numeric representations of pixels that describe the image. So, while we see "things" in the images, a computer sees this...
Even more, the way our brain visualize and process images is very useful and efficient. Neurons dedicated to vision do not work all at the same time, but only some specially trained neurons that are ready to react to a determined receptive field: that means, some neurons may identify vertical lines, while other horitzontal ones, other colored ones... and this is precisely what Convolutional Neural Networks are inspired on.
Convolutional Neural Networks to the rescue...
A convolutional neural network is a special type of neural network (an algorithm) consisting of multiple convolutional layers. The math operation convolution is used to process image data.
Similarly to what happens with our visual cortex neurons, the neurons in a convolutional layer are not connected with all the neurons of the next layer (which will make a fully conected layer), but only a small area of a layer is connected with the neurons that are in a another small region of the next layer. This makes CNNs a very good choice in image processing tasks, making CNN fast and efficient.
If you want to check how a CNN works, I advise you to visit this site in which you will find a lot information. For more complexity, but still simple, you can check this other. Let's see here a summary of how a CNN works:
CNNs basically have 2 big steps: feature extraction and classification.
1) Feature extraction: in this stage, the input of the network is an image, and the output are the features, or important aspect of that one, for instance, if the image is a cat, probably the ears, legs and moustache are important features.
- The network takes an image as imput and makes a convolution with a filter or kernel, that is just a matrix with numbers. The filter (green) scans the image (blue) making convolutional operations in every set of pixels, getting what is called a feature map.
- With the different filters we generate the next convolutional layer in the network, with lesser dimension than the former. That is what is called pooling. The most popular function is called max pooling and it is what we will use next, since we almost do not lose information and we gain a lot of processing time.
2) Classification: another advantage of CNN's, it is that it is not necessary to create a layer from zero every time we want to create a classification model. Many times we must find a pretrained network, meaning a network with many layers already trained in the feature extration task to identify stripes, colours, circle lines, etc. and just adapt the last layers of the network to our particular case, this is what is known as fine tuning
Everything alright until now? No worries, this is not easy and it is normal to give it several tries before catching it.
hey! surprise! there is an easier approximation to all this!!
Currently CNN's are very easy to implement and train, because nowadays we have many libraries (tools) like Keras that make our life easier. Remember Keras is just a library in Python to do Deep learning (AI).
Ready? 3...2...1...Let's code!!
Let's create a CNN with 2 convolutional layers. After each convolution, we will add a pooling layer using the max pooling function the we mentioned before-
After the second convolutional layer, we will add 2 thicker layers (fully conected) that will output the feature map of every image and will make predictions. The second thick layer has only 1 unit becuase the result is binary.
When we compile the model, we will use an optimization function (in this case Adam), we will not explain that in this project, but it is useful to know those exist and are useful.
Right now, the variable cnn is our model. Let's see how many layers have:
Let's train our model with the training set
Let's check our model using the function summary()
Now, we compile the model
We did it!! We have created a CNN (a neural network artificial intelligence) that is able to classify pneumonia in chest x-rays!!!
But.. watch out!... Not all that glitters is gold!
Every time our training data pass through the CNN, we call that an epoch. In the end of each epoch, a loss function is applied to calculate how different is our prediction (classification in this case) from the ground truth (the real truth). The resulting difference is the training loss. Our model use this value to update the weights of the filters. That is the acclaimed technique in artificial intelligence known as backpropagation.
At the end of each epoch, we also use that loss function to evalute the loss in the dataset and obtain a validation loss that measure how the model prediction matches the validation data. This validation datase is used to test the performance of our model (our AI).
If the loss is small, means the model has classified quite well the images (or at least most of the images) that has read (seen) in that epoch.
The next graph show us the loss process during each epoch.
Let's stop and check what is happening here...
The blue and orange lines correspond to the training data (loss and accuracy respectively). We see that as epoch after epoch goes by, the losses are lesser and lesser, as it is expected since the model is doing backpropagation, and the accuracy is increasing.
The green and red lines correspond to the validation data. Again we can see that the losses diminish and the accuracy (meaning, how similar our prediction and the truth are) is increasing.
Ok, great, our loss diminish and our accuracy increase, that is awesome, right? Nope, now we shall notice 2 other things:
- The accuracy in the training data (orange) is far higher than in the validation data (red).
- The loss in the training data (blue) is much better than in the validation data (green).
What this means? Easy.. our model is working too well, is adjusting too much to the training data. What is hppening is that our model is *overtiffing**, that is the reason accuracy was so good (90%).
How we prevent overfitting?
- Augmenting the dataset
- Reducing the number of features
- Normalizing the model: reducing complexity (layers), addind parameters, trying other optimization functions, etc.
- Early stopping
And many other techniques that make possible that a graph like the one we saw above, now it looks like this:
We may want to apply to our model all the metrics that we want (i.e. ROC curve, AUC index, accuracy-recall curve, confusion matrix, etc.) in order to improve the model and therefore reaching to conclusions that are useful for us, humans.
To sum up
Again and again there are more AI application to the health sector. Sometimes that may make us think that we people will become more expendable, and that algorithms or artificial intelligence machines will perform every job. However, nothing is further away from reality. There are already many organizations that apply AI based solutions to improve clinic diagnosis, optimize treaments, biometrics signal monitorization, etc. But all these organizations are focused on creating tools that will support people, health professionals, to perform a better job, sometimes health related, other administrative or technical.
There are many AI based health companies!! Who knows.. maybe the next big startup using AI to transform medicine and healthcare will be the one you start... Why not?