Introduction
This data analysis project's use of linear regression analysis to forecast profits based on sales, quantity, and discount rates is essential. The Superstore dataset can be used to build a linear regression model, which can provide important insights into the relationships between these important variables. In addition to offering a tool for projecting future profits based on these variables, this predictive model helps to understand how changes in sales, quantity, and discount affect profitability. Businesses are able to make well-informed judgements on pricing tactics, inventory control, and overall profitability optimisation by using this analytical method.
Data Importing
Run to view results
Using Python's pandas module, the data import section of the project loads a dataset called "dataset_Superstore.csv" and displays the first few rows of the data frame. The code snippet first loads the pandas library before using the pd.read_csv() function to read the CSV file into a DataFrame called "df." It then uses df.head() to show the first few rows of the dataframe. This stage is essential because it enables an initial review of the structure and contents of the dataset before moving forward with other data analysis tasks.
Data Cleaning
Run to view results
Run to view results
Run to view results
Run to view results
To guarantee the accuracy and consistency of the dataset, the data cleaning phase of the project entails a number of crucial procedures. First, df.info() is used to examine a summary of the data frame in order to comprehend the data structure, which includes column names, non-null counts, and data kinds. An summary of the dataset's features is given in this stage. Then, duplicate rows in the dataset are located and dealt with by using df.duplicated().sum() to check duplicate entries. In this instance, the dataset contains no duplicate entries. Additionally, pd.to_datetime() is used to convert the 'Order Date' and 'Ship Date' columns to datetime format in order to guarantee consistency and facilitate time-based analysis. To ensure data coherence, an additional check is made to confirm that all "Order Date" values come before their matching "Ship Date" values. Finally, finding rows where the profit was negative but the item was not returned is a specific data cleansing operation. To do this, rows are filtered according to parameters pertaining to profit and return status, which sheds light on possibly incorrect data entries. When all of these data cleaning procedures are used together, a clean and trustworthy dataset is ready for additional analysis and visualisation
Outliers Detection
Run to view results
The dataset's key variables' distribution and properties are explained by the following data statistics: Profit: With a standard deviation of about $114.27, the average profit is almost $30.74. While the maximum profit is $731.44, the minimum profit is -$653.28, indicating losses. Sales: $204.94 is the average sales figure, and the standard deviation is roughly $370.86. There is a minimum of $0.44 and a maximum of $2099.59 for sales values. Quantity: With a standard deviation of roughly 9.32 items, an average of 3.95 things are ordered for each transaction. There is a minimum of 1 item and a maximum of 526 items in the quantity. Discount: With a standard deviation of about 0.21, the average discount that is applied is about 0.16. The range of discounts is 0 to a maximum of 1.3. The distribution and fluctuation of profit, sales, quantity, and discount values within the dataset are shown by these statistics. Negative earnings and large discounts point to certain areas that should be investigated further in order to maximise pricing and profitability. Furthermore, the broad variety of quantities ordered suggests different buying habits from customers, which may be investigated for more focused marketing campaigns or better inventory control.
Summary Statistics
Run to view results
The dataset comprises 9994 records that have spoken statistics for quantity, profit, sales, and discount. Gain: With a standard deviation of about $114.27, the average profit is roughly $30.74. Profits might be as low as -$653.28 or as much as $731.44. Sales: $204.94 is the average sales figure, and the standard deviation is roughly $370.86. The range of sales values is $0.44 at the lowest and $2099.59 at the most. Quantity: With a standard deviation of approximately 9.32 items, the average quantity ordered is 3.95 items. There is a minimum of 1 item and a maximum of 526 items in the quantity.
Discount: The average discount applied is roughly 0.16, with a standard deviation of around 0.21. Discounts vary from a minimum of 0 to a maximum of 1.3. These statistics provide insights into the distribution and variability of key metrics in the dataset, highlighting the range and spread of values for each variable.
Data Plotting
Run to view results
Using the matplotlib and seaborn libraries, the data plotting aspect of the project entails visualizing important insights from the dataset. There are two distinct plots made: Total Revenue for Each Category Bar Plot: Combining the data by segment and adding the profits yields the overall profit for each 'Segment' category. The total profit for each section is shown as a bar plot, with gridlines added for improved visibility and the color set to coral. A clear comparison of the profitability of the various parts is given by this graphic. Sales vs. Profit Scatter Plot: To illustrate the link between "Sales" and "Profit," a scatter plot is made. To improve differentiation, the scatter plot's markers are now triangles, and transparency has been modified for clarity. This graphic illustrates any patterns or trends in the data and aids in understanding the relationship between sales values and profitability. Important parameters like profit distribution across segments and the correlation between sales and earnings are graphically represented by these visualizations. They offer a more perceptual comprehension of the information, supporting the identification of trends, anomalies, or prospective areas requiring additional investigation.
Linear Regression
Run to view results
This code snippet loads the dataset "dataset_Superstore.csv" into a DataFrame, uses train_test_split to split the data into training and test sets, assigns features and target variables by dropping specific columns, and does one-hot encoding on categorical columns. It also checks if the encoded DataFrame contains the "Profit" column. The data splitting process is carried out if the 'Profit' column is located; if not, a message stating that the 'Profit' column is missing is printed. This section of code splits the data for model training and evaluation, encodes categorical variables, and handles missing columns to efficiently prepare the data for machine learning tasks.
Run to view results
To begin with, this code snippet imports the required libraries: LinearRegression from sklearn.linear_model for linear regression modelling and pandas as pd for data handling. 'dataset_Superstore.csv' is the dataset that is subsequently loaded into a DataFrame. After choosing the pertinent independent variables ("Sales," "Quantity," and "Discount") and designating "Profit" as the dependent variable, the data is ready for linear regression. Using the same dataset, a linear regression model is developed, fitted to the data, and then utilised to provide predictions. utilising the Superstore dataset, this code effectively carries out linear regression analysis, training a model to forecast profits based on sales, quantity, and discount values before utilising the model to produce predictions.
Run to view results
The first step in this code snippet is to load the required libraries: LinearRegression from sklearn.linear_model for linear regression modelling and pandas as pd for data handling. 'dataset_Superstore.csv' is the dataset that is subsequently loaded into a DataFrame. The 'Sales' and 'Profit' columns are taken out of the dataset by the code, which then restructures the 'Sales' column to become an independent variable and designates 'Profit' as the dependent variable. The 'Sales' and 'Profit' columns are used to fit the data into a linear regression model. To show how to use the trained model to estimate profit based on sales data, an example prediction is produced on a new sales value of 10. Next, the projected profit figure is displayed on the console.
Run to view results
The libraries required for data processing, model selection, and evaluation are imported by this bit of code. It inserts the 'dataset_Superstore.csv' dataset into a DataFrame, chooses 'Profit' and 'Sales' as features (X), and 'Returned' as the designated variable (y). An 80/20 ratio is used to divide the data into training and testing sets. To forecast results on the test set, a logistic regression model is built, trained on the training set, and applied. Accuracy is used to gauge the model's performance, and a classification report including the F1-score, precision, recall, and support for each class is produced. Lastly, the code outputs the classification report and accuracy score to evaluate the model's predicted performance using the test set of data.
Run to view results
'Sales', 'Quantity', and 'Discount' are selected as independent variables (X) and 'Profit' is chosen as the dependent variable (y) in order to prepare the data for linear regression. This code loads a dataset called 'dataset_Superstore.csv'. An 80/20 ratio is used to divide the data into training and testing sets. To forecast profits on the test set, a linear regression model is developed, trained on the training set, and applied. The model's performance is then assessed by computing the mean squared error, which sheds light on how well the model matches the data. The code also provides an interpretation that highlights the impact of quantity, sales, and discount on profit, highlighting the role of the coefficients on profitability—a lower mean squared error being indicative of a better fit between the model and the data.
Conclusion
Using Python's pandas unit, the dataset "dataset_Superstore.csv" was successfully loaded into a DataFrame during the first phase of data import and cleaning. This stage made it possible to investigate the structure and contents of the dataset in the beginning, giving future analysis a solid grasp. Next, steps were taken to assure data consistency and correctness through data cleaning operations. In order to simplify time-based analysis, date columns were changed to datetime format and duplicate values were checked for absence. Notably, data coherence was ensured by verifying that each "Order Date" value came before its matching "Ship Date" value. The 'Profit', 'Sales', and 'Quantity' columns were examined for outliers, and the extreme values were substituted with upper or lower boundaries derived from three standard deviations from the mean. Through the resolution of any potential abnormalities that can alter analytical findings, this procedure attempted to increase the dataset's integrity. Important variables such as profit, sales, quantity, and discount were given summary statistics that gave information about their distribution and characteristics within the dataset. Further study is guided by these data, which provide insight into the variability, ranges, and average values of key indicators.
Recommendations
It is advised to conduct additional research to examine market segments with negative profits but no returns in order to build on these findings. By looking at these cases, one can find the underlying causes of the unprofitability in the face of no returns and make strategic changes to increase profitability. Improved visualisation methods could be used to produce more plots that investigate correlations between various variables. Gaining deeper insights into the factors driving profitability can be achieved by visualising the correlations between multiple metrics, such as profit and sales across different sectors or geographies. Further, taking into account the use of sophisticated analytics methods like clustering or predictive modelling may provide more in-depth understanding of consumer behaviour patterns or pricing schemes. Businesses can optimise pricing strategies and operational efficiency by utilising machine learning algorithms for predictive analytics on sales patterns or client segmentation. This allows for data-driven decision-making. All things considered, a thorough methodology that combines in-depth analysis with cutting-edge analytics techniques can extract insightful information from the dataset and direct strategic decision-making for enhanced corporate performance.
Run to view results