# Coursework 2 - C11BD

Leticia Hurtado de Mendoza - H00434517

18/03/2023

# Introduction

The following report, based on the 'Superstore.csv' dataset, aims to analyse your data to determine how you can increase your profits. In this document, an initial data import will be carried out, followed by data cleaning to identify possible errors and outliers, then an analysis of the statistical data obtained, and finally a visualisation of the data by plotting the data and carrying out a linear regression analysis.

# Import the Data

In this very first step, we are importing all packages that can be useful as we go deep into the analyse of this Data.

When importing the dataset, first the pandas package needs to be imported as well. Later, a table shows all the 'Superstore.csv' dataset.

# Re-labeling Columns

Next step is re-labelling columns to get a cleaner and better view of all the Data, showing more specific information about different categories by using 'pd.unique()' method, particularly useful for large datasets. helps to identify unique elements within the Data and will be useful for future Data exploration, cleaning and analysis tasks.

# Cleaning the Data

The 'Cleaning the Data' step includes removing data entry errors and outliers. During this stage, different techniques and methods can be applied. In this particular case, the methods to be used are, for missing values 'isnull()', to find out outliers, 'describe()', and for data entry errors, 'dtypes'. Lastly, a visual inspection will be carried out to identify possible anomalies within.

Now that we have got rid off all possible data entry errors, next step is to identify outliers and handling them.

After identifying and handling outliers from relevant columns in our Dataset, we ned to verify that they have been removed correctly, and that will be done by using the IQR method. (See below)

# Summary Statistics

Summary statistics is useful because it provides a concise overview and understanding of the dataset characteristics. It offers a quick and easy view of aspects of our data distribution, central tendency, dispersion and shape.The first thing to be done, is to print out all statistical values for all column.

Summary statistics for relevant columns show the dataset's central tendency and spread. The mean 'Profit' stands at $8.8, with a standard deviation of $8.25. 'Sales' display considerable variability, with a mean of $41.28 and standard deviation of $42.36. Most 'Discounts' hover around 9%, but a few orders experience none or exceptionally high discounts up to 50%.

# Plotting

Plotting is useful to visualise our data and understand it in an easier way. In order to be able to plot, some packages were imported at the beginning of this notebooks, such as 'matplotlib', for example.

In this graph, a comparison of Sales v. Profit is shown, including negative values, understood as loss.

The above graph shows who buys and how much, and it can help the business to see where to put the focus on in order to increase their profitability.

This last graph shows the number of orders per region, and the most orders are made from West and East region, which can help on decision-making for different selling strategies for the company.

# Modelling strategy - Linear Regression

Linear regression has been the chosen model strategy to find out what are the features that contribute to profit the most, and the outcome shows that these are 'Profit', 'Row ID', 'Quantity', 'Sub-Category Number' and 'Segment'.

It is useful especially when there is a linear relationship between the independent variables and the dependent one. It can help us because it can be used as a predictive modelling task, or also to find a trend analysis or correlation analysis.

So, based on the analysis conducted on the Supersore.csv datasheet provided, some actionable insights and strategies can be carried out by the company in order to increase its profitability. - Strategies based on segments: the analysis done reveals that the majority of orders are from Consumer insight, so to potentially increase profits, the company has to develop targeted marketing and sales strategies for the Corporate and Home Office segments, This can involve offering tailored promotions, product bundles or loyalty programs to attract these segments. - Regional focus: as seen on the graph, the distribution of orders by region indicates that the East and West regions generate the most orders, so while maintaining efforts in these regions, the business should consider investing resources in regions where fewer orders are placed. This can involve market research to understand different local preferences, adjusting pricing strategies, or increasing advertising and promotion efforts in those regions. - Product mix optimization: by analysing the sales data by product category and sub-category, some insights were revealed into which products are driving profitability and which ones are underperforming. The company has to focus on optimizing its product mix by promoting high-profit margin items or introducing new product categories with higher demands. - Cost-reduction strategies: this can involve negotiating better supplier contracts, reducing waste in the supply chain or implementing technology solutions to streamline processes. - Outlier management: continuously monitoring and managing outliers, especially in the ‘Profit’ column. By removing or implementing strategies to mitigate outliers’ impact, such as adjusting pricing or refining inventory management practices, the company can ensure more consistent and sustainable profitability. - Prediction: by using the linear regression modelling strategy, key factors that contribute to profitability can be identified, and by understanding which variables have the most significant impact on profit, the company can prioritise resources and investments accordingly.