Pandas Group By
This notebook demonstrates the following features of Pandas:
- Read selected columns (usecols)
- Identify missing values (isnull)
- Drop rows with missing values (dropna)
- Split and summarize data by categories (grouopby)
- Sort data by column values (sort_values)
- Pandas visualization (bar chart, scatter plot)
- Pandas visualization backend (matplotlib, plotly)
References:
Step 1 - Set Up Environment
1.1 - Install Plotly
Plotly is an interactive visualization library and can be utilized as the backend for Pandas. Pandas support multiple visualization backends. For example, matplotlib, seaborn, plotly, bokeh are all supported.
1.2 - Import pandas
Step 2. Read Data
We will read data directly from the web link. Since the dataset has almost 2000 columns, we only read a selected list of columns of interest.
Step 3 - Cleanse Data
The data have many missing values and other quality issues and requires cleansing.
3.1 - Handle Missing Values
For simplicity, we simply drop all rows that contain missing values.
3.2 - Cleanse Median Earnings Column
Step 4 - Aggregate Data
Step 5 - Visualize Data
5.1 - Bar Charts
We want to rank the states based on their in-state tuition and their potential earnings (10 years after graduation)
5.2 - Scatter Plot
We want to find out if there is a relationship between tuition and potential earnings.