🏭 The Engineering Project with Python

Project Description

In this project, I'll pretend I've been recently hired as a data analyst for a manufacturing/engineering /science company. More specifically, I've been hired as a data analyst for a mining company called Metals R' Us & I've been given data from their flotation plant.

Topics/Skills covered

✅ Variables ✅ Print statements ✅ Mathematical Operations ✅ Functions ✅ Loops ✅ IDE's ✅ Libraries ✅ Read in data w/ Pandas ✅ Descriptive analytics w/ Pandas ✅ Filtering w/ Pandas ✅ Data visualization

💾 Dataset

You can find more information about the dataset and can download it using this link (https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process?resource=download)

1. Import the libraries and the dataset.

# import ne libraries need for this porject import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

# read CSV file to work with df = pd.read_csv('MiningProcess_Flotation_Plant_Database.csv', decimal= ",") df.head()

3. Counting the number of rows and Columns

df.shape

Our dataset has 737,453 rows and 24 columns (Attributes)

4. Indexing Data frame (Rows & Columns)

df['% Iron Concentrate']

For example, if I needed only lines 100-104, & all the columns, I'd use:

df.iloc[100:105,:]

4. Working with Dates

print(type(df)) print(type(df['date'])) print(type(df['date'][0]))

df['date'] = pd.to_datetime(df['date'])

print(type(df['date'][0]))

5. 📊 Descriptive Analytics

My boss has asked me to give some summary statistics for each column.

round(df.describe(),2)

The % Iron Concentrate is the most important variable. But my engineer peer tells me that the % Silica Concentrate, Ore Pupl pH, & Flotation Column 05 Level are all really important as well. My boss says something weird happened on June 1, 2017, & wants me to investigate.

I need to pair the data down to only have these columns & rows between these two dates.

df_june = df[(df['date'] > "2017-05-31 23:59:59") & (df['date'] < "2017-06-02")].reset_index(drop=True) df_june

important_cols = [ 'date', '% Iron Concentrate', '% Silica Concentrate', 'Ore Pulp pH', 'Flotation Column 05 Level' ] df_june_important = df_june[important_cols] df_june_important

6. Pair Plot (Multiple Scatters) & Correlation

sns.pairplot(df_june_important)

I want to see if there are any relations among the important variables. The graph above doesn't seem to show any important correlation among the variables.

This can be confirmed with a correlation matrix & noticing all the correlation values are low.

round(df_june_important.corr(), 2)

7. 📈 Line Charts

Your boss is a bit confused & wants to see the data to help him understand more. He wants to see how the % Iron Concentrate changes throughout that day. In this case, I will use a line chart to visualize the data.

sns.lineplot(x = "date", y = '% Iron Concentrate', data = df_june)

for i in important_cols[1:]: sns.lineplot(x='date', y=i, data=df_june) import matplotlib.pyplot as plt plt.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}🏭 The Engineering Project with Python