Python Project 5 - K-Means Clustering
In this project we want to explore a dataset which contains information on breakfast cereals. We are only going to consider the manufacturers, cereal ratings, sugar and carbohydrate levels. Some of the values in sugars and carbohydrate features have invalid data which we must address.
We want to answer the following questions.
- What cereal manufacturers have the most/least varieties?
- What cereal has the highest rating?
- Investigate sugar levels in cereals
- Does each manufacturer provide cereals equally among the sugar levels?
- How are the cereals clustered by sugar levels (low, medium, high)
- What are the cereals which have the highest sugar level? lowest sugar levels?
- How does your favorite breakfast cereal rank?
- Investigate carbohydrate levels
- Cluster cereals in low, medium and high level of carbs
- What cereals are low sugar and low carbs?
What cereals are high sugar and high carbs?
Steps:
- Input libraries and K-Means model from scikit-learn
- Create dataframe and print out a few lines
- Drop all features in dataframe except those we are using: Name, Manufacturer, Sugars and Carbohydrates, and Rating; alternately you can create a new dataframe with just the features you want
- Check to see if there are any NaN values in sugars or carbohydrates features
- Check to see if there are any invalid values (that is values <0) in sugars or carbohydrates features; if there are 2 or less cereals with these invalid values drop the cereal(s) from dataframe; otherwise replace negative values with 0.
- Plot number of cereals for each manufacturer; if any manufacturer has only 1 cereal, drop that data instance. Which manufacturer has the most varieties of cereals.
- In dataframe the manufacturer is listed by a single letter; for plotting purposes add a column giving the actual name and delete feature 'mfr'
- Use a plot to determine which manufacturer has cereals with the highest ratings.
- What cereal has the highest rating?
- Plot sugars levels vs manufacturer and determine which brand has lowest sugar levels.
- Cluster data using K-Means and the sugars feature with clusters low, middle and high sugar levels; print out cluster centroids; add cluster as feature in dataframe.
- Determine which cluster is associated with low, mid or high sugar levels; add feature giving sugar level corresponding to cluster
- Create a plot to show how cereals are distributed among sugar levels; what cereals have highest sugar levels? lowest sugar levels?
- Repeat the steps you did for investigating sugar levels using carbohydrate levels.
- Print out the cereals that are low sugars and low carbohydrates.
- Print out the cereals that are low sugars and high carbohydrates.
Data Set Description
This dataset is contained in the file `cereal.csv'
This dataset contains information on 77 different breakfast cereals. The features are
- Cereal Name
- Manufacturer
- A -> American Home Food Products
- G -> General Mills
- K -> Kelloggs
- N -> Nabisco
- P -> Post
- Q -> Quaker Oats
- R -> Ralston Purina
- Type (C -> Cold, H -> Hot )
- Calories (per serving)
- Protein (in grams)
- Fat (in grams)
- Sodium (in milligrams)
- Fiber (in grams of dietary fiber)
- Carbo (grams of complex carbohydrates)
- Sugars (in grams)
- Potass (milligrams of potassium)
- Vitamins (possible values are 0, 25 or 100 indicating percent of FDA recommended vitamins and minerals)
- Shelf ( display shelf with possible values 1, 2, or 3 counting from floor)
- Weight (weight of one serving in ounces)
- Cups (number of cups in one serving)
- Rating (Consumer ratings of cereal from 0 to 100)
Cluster by sugars into highest, middle and lowest levels; random initial guess
Repeat calculations and plots for carbohydrates instead of sugars
What cereals are high carbs and low sugar? What are low carbs and low sugar?