Project goal
Artificial intelligence (AI) has the potencial to revolutionize desease diagnosis and management by performing classification difficult for human exprtes and by rapidly reviewing immense amount of images. In this project, based on the article Kermany (2018), our objective is to classify individuals with pneumonia. After this first phase, our idea is to identify the cause of the pneumonia, whether it is viral or bacterial.
Dataset description
The dataset is organized into 2 folders (train, test) and contains subfolders for each image category (pneumonia/normal). There are 5,863 X-Ray images (JPEG) and 2 categories (pneumonia/normal). Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care. For the analysis of chest x-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checkd by a third expert.
Reference: Kermany, Daniel S., et al. "Identifying medical diagnoses and treatable diseases by image-based deep learning." Cell 172.5 (2018): 1122-1131.
Importing files
Labeling types of pneumonia
Viewing database samples
Exploratory Data Analysis
As we can see in the EDA, there is an imbalance in the database between people who have pneumonia and those who do not. In the first case, we have about three times more data sample. We also need to highlight that this imbalance persists in the sample for the causes of pneumonia, viruses and bacteria, in the same proportion.
Data Augmentation
Data augmentation is a strategy used to improve the diversity and the quality of the data, especially in the computer vision domain. In order to solve inbalance data and avoid overfitting problem, we need to expand artificially our dataset. We can make your existing dataset even larger. The idea is to alter the training data with small transformations to reproduce the variations. Approaches that alter the training data in ways that change the array representation while keeping the label the same are known as data augmentation techniques. Some popular augmentations people use are grayscales, horizontal flips, vertical flips, random crops, color jitters, translations, rotations, and much more. By applying just a couple of these transformations to our training data, we can easily double or triple the number of training examples and create a very robust model.
Nessa função precisamos ter: 1. Aleatoriamente pegar imagens e fazer a geração de imagens; 2. Salvar as imagens em uma pasta diferente para não juntar com as imagens raw