Exploring Ebay Car Sales Data
Used cars websites tend to offer the best value-to-price ratios for those who wish to purchase cars. In this project, we'll be analyzing an Ebay used cars dataset to understand which cars are the most valuable for value- and price-conscious buyers.
To make our recommendation, we'll try to uncover a value-to-price metric we can use to rank car models
When considering our price-to-value metric, the car brand Sonstige Autos have the lowest mean price with the highest mean value.
dateCrawled: When this ad was first crawled
name: Name of the car
seller: Whether the seller is private or a dealer
offerType: The type of listing
price: The price on the ad to sell the car
abtest: Whether the listing is included in an A/B test
vehicleType: The vehicle type
yearOfRegistration: The year in which the car was first registered
gearbox: The transmission type
powerPS: The power of the car in PS
model: The car model name
kilometer: How many kilometers the car has driven
monthOfRegistration: The month in which the car was first registered
fuelType: What type of fuel the car uses
brand: The brand of the car
notRepairedDamage: If the car has a damage which is not yet repaired
dateCreated: The date on which the eBay listing was created
nrOfPictures: The number of pictures in the ad
postalCode: The postal code for the location of the vehicle
lastSeenOnline: When the crawler saw this ad last online
We can observe that columns such as
notRepairedDamage are in German, while this analysis will be directed towards an English-speaking audience. We can also note that
monthOfRegistration is captured in integer values.
Most of the columns contain
string data, with the remainder being
Additionally, columns such as
lastSeen are saved as type
object instead of
notRepairedDamage have a high number of null values, though none have more than 20% of the dataset.
Cleaning column names
Let's convert the column names from camelcase to Python's preferred snakecase and rename columns to be more descriptive.
Converting column types
We previously identified
object types that should instead be saved as dates. Let's update them
Let's investigate which
string columns have the lowest number of unique values.
Of the string columns, only the
offer_type columns contained information that was overwhelmingly one-sided:
- From the
sellercolumn, privat translates to private, while gewerblick translates to commercial. This suggests that most people were private sellers.
- From the
offer_typecolumn, angebot translates to offer, while gesuch translates to request. This suggests that most people were offering their car for sale.
Because these columns are one-sided, they don't add useful information to our analysis. Let's drop these columns.
Let's investigate the
We can observe with
num_of_pictures that none of the listings contain any pictures, so we can drop that column too.
Because we'll likely be using the
price column to determinate a value-to-price metric, we should remove any outlier information that might be contained in these columns
We can observe the following:
pricecolumn contains a maximum value of 21,474,83,647. This is far removed from the median of 2,950. It also contains a minimum value of 0 which could impact our analysis
- The median and 75th percentile value in the
kilometercolumn share the same amount as the max value. This indicates that there is an upper bound of 150,000 kilometers that the majority of cars hit.
Inspecting the top 10 values shows that the main outlier has a value of 21,474,83,647, while the other outliers appear to hit a limit of 99,999,999.
Cars that are two expensive will negatively impact our value-to-price metric. Let's drop rows that have a value far removed from the 75th percentile.
Inspecting the 5000th element indicates a value of 32,600. This is a better approximation of the max for the purposes of our analysis given our 75th percentile value of 7,200.
Let's also investigate listings which had a price of 0.
We can see that there are 10,778 rows with a price equal to
0. This will also affect our value-to-price ratio so we let's also filter these out.
We can observe the following:
- There are 212,800 cars which have a
kilometervalue of 150,000. This validates that there is indeed an upper limit for kilometers.
- None of these cars have a
kilometervalue of 0. This makes sense given that this is a used cars listing.
We won't make any changes to the
We can see that:
registration_yearhas values as low as 1000 and as high as 9,999.
registration_monthspans from 0-12. Since the lowest value is at 0 (presumably denoting January), we would expect the highest value to be at 11 (denoting December).
Let's investigate these further
Let's remove those values with years greater than 2016 (the time which the ads were created) and years before the 1900.
registration_month is more problematic as the potential values for January (Month 0 and Month 12) both have 20,000 rows.
It's unclear which of these is incorrect, but deleting either removes a significant number of rows. We'll keep this in mind as we conduct the rest of our analysis.
Creating a Value Metric
We'll be creating a value-to-price metric, primarily based on the price of the car and the kilometers travelled.
Initial value metric
We can observe that the car that costs 10,000 but has driven 50,000 kilometers has a higher value metric than the car that has cost 10,000 but has driven 150,000 kilometers.
Naturally, a car that costs
1,000 but has only driven
50 kilometers would have a value as high as
20.This is why we set limits for
kilometer during our data cleaning. Here's a reminder:
This greatly limits the range that would be seen in our newly-created
Halve value if car is unrepaired
If a car has unrepaired damage, we'll halve its value
Scale value metric between 0 and 1
Our value metric is currently between the followings numbers
Let's scale these to be between
1 for easier analysis.
Analyzing Price and Value by Brand
We'll now investigate the mean
value for a given brand.
Brands by Price
As expected, we can observe that a lot of the high-name brands have a higher mean
Brands by Value
When considering brands by our
value metric, however, we can observe that sonstige autos ranks the highest.
Recommendations for High-Value, Used-Car Brands
From our analysis, it's clear that sonstige_autos, mini and porsche rank highest when considering our newly-created
Of these 3 brands, the mean
price of a Sonstige Autos car is the lowest, thus we recommend that value- and price-conscious buyers purchase Sonstige Autos brands.
In this project, we analyzed Ebay used cars data to understand which cars are most valuable for value- and price-conscious buyers.
The conclusion we arrived at was that Sonstige Autos brands were most valuable as they ranked highly in our
value metric and had the lowest mean