Data Preprocessing Project Unit 1

After every step done to read and preprocess a dataset is the time to apply all the knowledge.

Exploratory Data Analysis: Game reviews from user on Steam

Steam is the world's most popular PC Gaming hub, with over 6,000 games and a community of millions of gamers. With a massive collection that includes everything from AAA blockbusters to small indie titles, great discovery tools are a highly valuable asset for Steam. How can we make them better? YES! Reviewing their games

First we are going to analyze the data set in order to understand it.

##Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.colors as col import seaborn as sns

##Dataset Import df_sales=pd.read_csv('/work/steam_reviews.csv')

df_sales.head()

Here, we describe and see if there is some null variable on the data set

df_sales.describe()

df_sales.isnull().sum()

Data cleaning

In the review section, we can see that the only problem that this dataset has is the null values on the review column. Let's make something about it.

df_sales = df_sales.fillna(' ')

df_sales.isnull().sum()

Let's make tidy this dataset

sales_tidy=df_sales.groupby(['title', 'recommendation']).size() sales_tidy = sales_tidy.reset_index().rename(columns={"title": "Game", "recommendation": "Recommendation", 0:"Count"}) sales_tidy.head(74)

!pip install pandas-profiling==2.7.1 from pandas_profiling import ProfileReport

profile = ProfileReport(sales_tidy) profile

Conclusions

After this multiple processes, we can conclude that the user trend to judge the games during the early access games, and even if they are released they do not change the comment (based on the commentary date and the release date of the game), making report difficult and with a lack of sufficient information in order to deliver to the developer the information. Nevertheless, is important to understand what does the community wants and try to improve the user experience of each game, and which games have the biggest problem.

Project Part II: Visualization

After eliminated the null data, is time to get some order in the data set and make understandable the information.

df_sales['review_length'] = df_sales.apply(lambda row: len(str(row['review'])), axis=1) df_sales['recommendation_int'] = df_sales['recommendation'] == 'Recommended' df_sales['recommendation_int'] = df_sales['recommendation_int'].astype(int) len(df_sales['title'].unique()), df_sales['title'].unique()

This data set only have 48 games of the 6k+ that the platform has. Nevertheless, it is important to see if which game has the biggest

reviews_count = df_sales.groupby(['title'])['review'].count().sort_values(ascending=False) reviews_count = reviews_count.reset_index() sns.set(style="darkgrid") plt.figure(figsize=(25,20)) sns.barplot(y=reviews_count['title'], x=reviews_count['review'], data=reviews_count, label="Total", color="r") # reviews_count_pos = df_sales.groupby(['title', 'recommendation_int'])['review'].count().sort_values(ascending=False) reviews_count_pos = reviews_count_pos.reset_index() reviews_count_pos = reviews_count_pos[reviews_count_pos['recommendation_int'] == 1] sns.barplot(y=reviews_count_pos['title'], x=reviews_count_pos['review'], data=reviews_count_pos, label="Total", color="b")

sizes = [df_sales.recommendation.value_counts()[0], df_sales.recommendation.value_counts()[1]] labels = ['Recommended', 'Not Recommended'] explode = (0, 0.1) fig1, ax1 = plt.subplots() ax1.set_title('Games recommendation') ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) ax1.axis('equal') plt.tight_layout() plt.show()

pro = ProfileReport(df_sales) pro