Coursework 2
Aleena Shaju
H00446410
18/03/2024
INTRODUCTION
Python is extensively utilized for company data analysis because of its adaptability and robust libraries tailored for data manipulation, examination, and visualization. Here, Python is employed to scrutinize company data with the aim of enhancing profitability through data analytics. The process begins with data collection, where the company's data is imported into Python, either via libraries like Pandas or through direct database connections. Subsequently, the data undergoes cleaning and preprocessing to ensure its quality and reliability, involving tasks such as handling missing values, eliminating duplicates, and standardizing formats. Python's Pandas library offers efficient tools for this task, facilitating the preparation of data for analysis. Once the data is cleaned and preprocessed, exploratory data analysis (EDA) is conducted. EDA entails visually and statistically exploring the data to unveil patterns, trends, and insights that can guide strategic decision-making. Visualization libraries like Matplotlib, Seaborn, or Plotly are utilized to create visual representations that aid in understanding key metrics and relationships within the data. Customer segmentation emerges as a pivotal application of data analytics in boosting profitability. By employing the KMeans clustering algorithm on the cleaned dataset, distinct clusters of data points representing varying levels of profitability are identified. Utilizing KMeans clustering on factors such as quantity, sales, and discount enables businesses to extract valuable insights, potentially leading to profit improvement.
1. DATA IMPORTING
This process involves loading external datasets into the Python environment.
Run to view results
Run to view results
Run to view results
GENERAL DATA INFO
i) SUMMARY INFORMATION
Run to view results
ii) DESCRIPTIVE STATISTICS: to obtain understanding of the data's variability, central tendency, and distribution in numerical columns
Run to view results
2. CLEANING THE DATA
In order to make sure the data is correct, comprehensive, consistent, and trustworthy for analysis, a number of processes and techniques are used in data cleaning.
Before cleaning the datasets, the different columns in the data frame is renamed as in the coursework instruction.
Run to view results
i) Removing unnecessary columns
Run to view results
ii) Removing Duplicates
Run to view results
iii) Handling Missing Values: Dealing with Null entries
Run to view results
In this dataset, there are no missing values in any of the columns, as indicated by zeros across all columns. The absence of missing values suggests that the dataset is complete in terms of data entry and does not contain any null values.
iv) Standardizing formats
For ensuring that date columns in the dataframe ds1 are stored as datetime objects, which allows for easier manipulation, comparison, and analysis of dates and times thus ensuring consistency in data formats across the dataset.
Run to view results
v) Handling Outliers
Since outliers can have a substantial impact on statistical measures and model performance, identifying them in a dataset is an essential step in data analysis. To find data points that significantly depart from the central tendency, we are looking at the summary statistics for each numerical column, such as the mean, median, standard deviation, and percentiles.
Run to view results
Interquartile Range - IQR method
Calculating the IQR (the range between the 75th and 25th percentiles) can help identify outliers based on their deviation from the median of the dataset. Data points that fall outside a certain range (here 1.5 times the IQR) may be considered outliers
Run to view results
Removing Outliers
Run to view results
Cleaned dataset
Run to view results
Basic insights from cleaned dataset
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
3. EXPLORATORY DATA ANALYSIS
A. Exploring data statistically
i) NUMERICAL VARIABLES
Before determining the summary statistics of the cleaned data, the boolean values in the "returned" column is converted into integers.
Run to view results
Summary statistics of 'sales', 'quantity', 'discount', 'profit', 'returned'
Run to view results
To measure the asymmetry in the distribution of data points around the mean, we use the 'skewness' parameter. The distribution of sales is right-skewed, as indicated by the higher mean compared to the median (50th percentile). The distribution of quantity appears to be relatively symmetric, with a median (50th percentile) equal to the mean. Approximately 8.10% of transactions involve returned items, as indicated by the mean value.
Summary statistics of individual category
Run to view results
Furniture has the highest mean sales amount per transaction. However, Furniture also exhibits the highest variability in sales, as indicated by the highest standard deviation. The mean quantity per transaction is relatively similar across all categories, ranging from approximately 3.13 to 3.52 items. The mean quantity per transaction is relatively similar across all categories, ranging from approximately 3.13 to 3.52 items. There is minimal variability in quantity within each category, as indicated by the comparable standard deviations. Furniture has the highest mean profit per transaction. The variability in profit margins is evident, with Furniture and Technology exhibiting higher standard deviations compared to Office Supplies.
ii) CATEGORICAL VARIABLES
Categorical variables are non-numeric data types that represent categories or labels.
Run to view results
There are a total of 7,140 orders in the dataset. The dataset contains 4,184 unique order IDs, indicating that some orders may have multiple items. The majority of customers belong to the "Consumer" segment. All orders in the dataset are from the United States. "New York City" is the most frequent city, appearing 709 times, and "California" is the most frequent state, appearing 1,606 times.
iii) DATE VARIABLES
Run to view results
B. Exploring Data Visually
i) PLOTTING BAR CHART - CATEGORICAL
Run to view results
The above figure is a stacked bar chart showing the distribution of orders across different product categories for each region. Each bar represents a product category, and each segment within the bar represents the number of orders in a specific region. Office supplies is the most popular product category across all regions and Technology is the least popular category. There seems to be some regional variation in demand. For example, orders for Furniture appear to be higher in the east and west regions compared to the central and south regions. This could be due to factors such as the demographics of the population in each region
Run to view results
Office supplies is the highest-selling product category, with total sales exceeding 700,000. This suggests that office supplies are a major driver of sales for the company
Run to view results
Envelopes, labels, and storage containers appear to be top subcategories within office supplies based on their profit slice. Appliances and machines appear to have among the lowest profit margins based on the graph. Some subcategories, like 'Office Supplies - Storage,' show very high sales but comparatively lower profit, indicating that this subcategory might have lower profit margins. Across most subcategories, sales figures are significantly higher than profit figures. This suggests that while revenue from sales is strong, the profit margins vary greatly, possibly due to different cost structures or pricing strategies for each subcategory. 'Technology - Phones' stands out as the subcategory with the highest profit as well as high sales, indicating it is a strong performer.
Plotting- Scatter chart (Continuous)
Run to view results
A scatter plot showing the relationship between 'sales' and 'profit', with the size and color of the points representing the 'quantity' of items sold. We are taking the data based on a specified date range, and then creating a scatter plot of 'sales' against 'profit'. There appears to be a positive correlation between sales and profit. As sales increases, profit also tends to increase. However, the relationship is not perfectly linear, indicating that other factors might affect profitability. Despite the overall positive trend, there are notable variations in profit at different levels of sales. For example, some data points show high profit at moderate sales levels, while others show low or even negative profit at the same sales level. This suggests that sales volume alone is not the sole determinant of profit. However, higher quantities (warmer colors) are mostly concentrated in the middle range of sales and profit, indicating that selling larger quantities does not necessarily correlate with the highest sales or profit. There are occurrences of negative profit (losses) across various sales levels. These could be due to high production or operational costs, discounts, or returns that are not compensated by the sales revenue. The top right section of the graph, where both sales and profit are high, is sparsely populated, indicating that while high profitability is achievable, it occurs less frequently.
4. MODELLING STRATEGY
K means Clustering
K-means clustering is an effective unsupervised machine learning technique for grouping related data points into a fixed number of groups, or clusters. Deciding on the right number of clusters (K) for the given dataset is one of the most important tasks in the K-means clustering process. The Elbow Method is one approach that can be used for this.
Run to view results
Visually inspect the elbow curve and identify the point where the decrease in WCSS slows down (i.e., the curve starts to bend like an elbow). This point indicates the optimal number of clusters (any value between 3-10)
Run to view results
The Profit axis spans from approximately -40 to 60. There are data points that show negative profit, which could be of particular interest because they represent loss-making sales. The concentration of points above the zero line on the Profit axis implies that the majority of transactions were profitable. The Quantity axis ranges from 0 to about 9, and there is a dense cluster of points between the lower ranges of Quantity, Sales, and Profit. This could suggest that most transactions involve small quantities.
Recommendations
Examine Low-Profit and Loss-Making Sales: Look closely at the clusters where profit is low or negative. Determine if these losses are due to pricing, high costs, or other factors. Consider discontinuing products or services that consistently result in losses or re-evaluating the pricing strategy
Optimize Product Bundling: If certain clusters represent sales with high quantity but low profitability, consider whether bundling these products with higher-margin items could improve profits. This could encourage customers to purchase more profitable items alongside bulk products.
Customer Segmentation: Use the cluster information to segment customers according to their purchasing behavior. Tailor marketing campaigns and sales strategies to each segment to improve customer satisfaction and increase sales.
Inventory Management: Analyze clusters to manage inventory more effectively. Products that result in high sales and profit might need more inventory space, whereas low-selling, low-profit products may need to have reduced stock levels.
Cost Reduction Strategies: For clusters that have a high quantity of sales but lower profit margins, look into cost reduction strategies such as negotiating better terms with suppliers, reducing production costs, or finding more efficient distribution methods.
Price Adjustment: Reassess the pricing strategy for clusters that show high sales volume but low profit, as there may be room for a price increase without significantly affecting sales volume. Conversely, for clusters with low sales volume, consider whether a price reduction could stimulate demand.
Optimizing Pricing Strategies: Clustering customers based on the discounts they receive can help businesses understand the price sensitivity of different customer segments. By analyzing clusters with high discount rates, businesses can determine whether discounts are effectively driving sales or if they are eroding profit margins. Adjustments to pricing strategies can then be made to maximize profitability while maintaining customer satisfaction.
Product Development: Use insights from the clusters to inform product development. If a cluster with high profit margins is identified, consider developing similar or complementary products to capitalize on this success.
Analyze External Factors: Consider external factors that might affect certain clusters, such as seasonality, market trends, or economic conditions. Adjust business strategies to anticipate and respond to these factors.
Run to view results