import pandas as pd df = pd.read_csv ('cars.csv') df
Then, let observe the price mean of the autos with the atribute price_usd (numerical variable):
The median for this variable we have,
So, we can make a histogram in order to have a general picture of the real values of the cars. We can observe whether or not there is bias in the previous measures, and, indeed there is mosto of the values are between $0-10,000. Further, there are some extreme values might cause the bias.
So because of this bias, let's analized the prices per brand:
*note: with seaborn we can develop more interesting graph : https://seaborn.pydata.org/tutorial/distributions.html
import seaborn as sns sns.displot(df, x = 'price_usd', hue = 'manufacturer_name')
Take into account the type of visualization, make it as simple as possible. Let's try another variable that have less categories.
sns.displot(df, x = 'price_usd', hue = 'engine_type', multiple ='stack')
Let's check what about electric cars. In the following table we see that there are only 10!
Now, let dig into it more. Focus only in a particular brand.
First, lets filter the data to a particular case in this we filter for 'Audi' brand and, 'Q7' model.
q7_df = df[(df['manufacturer_name']=='Audi') & (df['model_name'] == 'Q7')] q7_df
Now, let's plot a histogram
sns.displot(q7_df, x= 'price_usd', hue = 'year_produced')
First, let's filter the data for kia brand and picanto model.
kia_df = df[(df['manufacturer_name'] == 'Kia') & (df['model_name'] == 'Picanto')] kia_df
Now, let's plot the histogram. Now we see that picanto cars are not expensive and for this data set there are not so many cars, only few for picanto model.
sns.displot(kia_df, x = 'price_usd', hue='year_produced')