Introduction

This data analysis project's use of linear regression analysis to forecast profits based on sales, quantity, and discount rates is essential. The Superstore dataset can be used to build a linear regression model, which can provide important insights into the relationships between these important variables. In addition to offering a tool for projecting future profits based on these variables, this predictive model helps to understand how changes in sales, quantity, and discount affect profitability. Businesses are able to make well-informed judgements on pricing tactics, inventory control, and overall profitability optimisation by using this analytical method.

Data Importing

import pandas as pd # Load the dataset df = pd.read_csv('/work/dataset_Superstore.csv') # Display the first few rows of the dataframe df.head()

Run to view results

Using Python's pandas module, the data import section of the project loads a dataset called "dataset_Superstore.csv" and displays the first few rows of the data frame. The code snippet first loads the pandas library before using the pd.read_csv() function to read the CSV file into a DataFrame called "df." It then uses df.head() to show the first few rows of the dataframe. This stage is essential because it enables an initial review of the structure and contents of the dataset before moving forward with other data analysis tasks.

Data Cleaning

# Let's check the summary of the dataframe df.info()

Run to view results

# Check for duplicates print(f"Number of duplicate entries: {df.duplicated().sum()}")

Run to view results

# Convert the 'Order Date' and 'Ship Date' columns to datetime df['Order Date'] = pd.to_datetime(df['Order Date'], dayfirst=True) df['Ship Date'] = pd.to_datetime(df['Ship Date'], dayfirst=True) # Check if all 'Order Date' are before the corresponding 'Ship Date' invalid_dates = df[df['Order Date'] > df['Ship Date']].shape[0] print(f"Number of rows with 'Order Date' later than 'Ship Date': {invalid_dates}")

Run to view results

import pandas as pd # Assuming 'df' is your DataFrame containing the relevant data # Check for rows where profit was negative but the item was not returned negative_profit_no_return = df[(df['Profit'] < 0) & (df['Returned'] == False)].shape[0] print(f"Number of rows where the profit was negative but the item was not returned: {negative_profit_no_return}")

Run to view results

To guarantee the accuracy and consistency of the dataset, the data cleaning phase of the project entails a number of crucial procedures. First, df.info() is used to examine a summary of the data frame in order to comprehend the data structure, which includes column names, non-null counts, and data kinds. An summary of the dataset's features is given in this stage. Then, duplicate rows in the dataset are located and dealt with by using df.duplicated().sum() to check duplicate entries. In this instance, the dataset contains no duplicate entries. Additionally, pd.to_datetime() is used to convert the 'Order Date' and 'Ship Date' columns to datetime format in order to guarantee consistency and facilitate time-based analysis. To ensure data coherence, an additional check is made to confirm that all "Order Date" values come before their matching "Ship Date" values. Finally, finding rows where the profit was negative but the item was not returned is a specific data cleansing operation. To do this, rows are filtered according to parameters pertaining to profit and return status, which sheds light on possibly incorrect data entries. When all of these data cleaning procedures are used together, a clean and trustworthy dataset is ready for additional analysis and visualisation

Outliers Detection

import pandas as pd import numpy as np # Assuming 'df' is your DataFrame containing the relevant data # Identify outliers in the 'Profit', 'Sales', and 'Quantity' columns cols = ["Profit", "Sales", "Quantity"] # Apply a lambda function to calculate a boolean mask for values that are outliers. Define outliers as being more than 3 standard deviations from the mean. outliers = df[cols].apply(lambda x: np.abs(x - x.mean()) > 3 * x.std()) # Replace these outliers with the respective upper or lower bound of 3 standard deviations from the mean. for col in cols: df.loc[outliers[col], col] = np.where(df[outliers[col]][col].mean() > df[col].mean(), df[col].mean() + 3 * df[col].std(), df[col].mean() - 3 * df[col].std()) # Check the dataframe after handling outliers df[cols].describe()

Run to view results

The dataset's key variables' distribution and properties are explained by the following data statistics: Profit: With a standard deviation of about $114.27, the average profit is almost $30.74. While the maximum profit is $731.44, the minimum profit is -$653.28, indicating losses. Sales: $204.94 is the average sales figure, and the standard deviation is roughly $370.86. There is a minimum of $0.44 and a maximum of $2099.59 for sales values. Quantity: With a standard deviation of roughly 9.32 items, an average of 3.95 things are ordered for each transaction. There is a minimum of 1 item and a maximum of 526 items in the quantity. Discount: With a standard deviation of about 0.21, the average discount that is applied is about 0.16. The range of discounts is 0 to a maximum of 1.3. The distribution and fluctuation of profit, sales, quantity, and discount values within the dataset are shown by these statistics. Negative earnings and large discounts point to certain areas that should be investigated further in order to maximise pricing and profitability. Furthermore, the broad variety of quantities ordered suggests different buying habits from customers, which may be investigated for more focused marketing campaigns or better inventory control.

Summary Statistics

# Calculate summary statistics for 'Profit', 'Sales', 'Quantity' and 'Discount' df[['Profit', 'Sales', 'Quantity', 'Discount']].describe()

Run to view results

The dataset comprises 9994 records that have spoken statistics for quantity, profit, sales, and discount. Gain: With a standard deviation of about $114.27, the average profit is roughly $30.74. Profits might be as low as -$653.28 or as much as $731.44. Sales: $204.94 is the average sales figure, and the standard deviation is roughly $370.86. The range of sales values is $0.44 at the lowest and $2099.59 at the most. Quantity: With a standard deviation of approximately 9.32 items, the average quantity ordered is 3.95 items. There is a minimum of 1 item and a maximum of 526 items in the quantity.

Discount: The average discount applied is roughly 0.16, with a standard deviation of around 0.21. Discounts vary from a minimum of 0 to a maximum of 1.3. These statistics provide insights into the distribution and variability of key metrics in the dataset, highlighting the range and spread of values for each variable.

Data Plotting

import matplotlib.pyplot as plt import seaborn as sns # Plot total profit per 'Segment' category (changed color and added grid) profit_per_segment = df.groupby('Segment')['Profit'].sum() plt.figure(figsize=(10, 7)) # Increased figure size sns.barplot(x=profit_per_segment.index, y=profit_per_segment.values, color='coral') plt.title('Total Profit per Segment', fontsize=16) # Increased font size plt.xlabel('Segment', fontsize=14) plt.ylabel('Total Profit', fontsize=14) plt.grid(axis='y', linestyle='--', alpha=0.7) # Added gridlines # Plot scatter plot 'Sales' vs 'Profit' (changed marker and transparency) plt.figure(figsize=(8, 6)) plt.scatter(df['Sales'], df['Profit'], marker='^', alpha=0.8) # Changed marker to triangle plt.title('Sales vs Profit', fontsize=14) plt.xlabel('Sales', fontsize=12) plt.ylabel('Profit', fontsize=12) plt.show()

Run to view results

Using the matplotlib and seaborn libraries, the data plotting aspect of the project entails visualizing important insights from the dataset. There are two distinct plots made: Total Revenue for Each Category Bar Plot: Combining the data by segment and adding the profits yields the overall profit for each 'Segment' category. The total profit for each section is shown as a bar plot, with gridlines added for improved visibility and the color set to coral. A clear comparison of the profitability of the various parts is given by this graphic. Sales vs. Profit Scatter Plot: To illustrate the link between "Sales" and "Profit," a scatter plot is made. To improve differentiation, the scatter plot's markers are now triangles, and transparency has been modified for clarity. This graphic illustrates any patterns or trends in the data and aids in understanding the relationship between sales values and profitability. Important parameters like profit distribution across segments and the correlation between sales and earnings are graphically represented by these visualizations. They offer a more perceptual comprehension of the information, supporting the identification of trends, anomalies, or prospective areas requiring additional investigation.

Linear Regression

import pandas as pd from sklearn.model_selection import train_test_split # Load the dataset df = pd.read_csv('/work/dataset_Superstore.csv') # One-hot encoding for categorical columns df_encoded = pd.get_dummies(df, columns=['Ship Mode', 'Segment', 'Country', 'Region', 'Category', 'Sub-Category', 'Returned'], drop_first=True) # Ensure 'Profit' column is present in df_encoded if 'Profit' in df_encoded.columns: # Assign features to X and target to y X = df_encoded.drop(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'City', 'State', 'Product ID', 'Product Name', 'Profit'], axis=1) y = df_encoded['Profit'] # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) else: print("Column 'Profit' not found in the dataframe.")

Run to view results

This code snippet loads the dataset "dataset_Superstore.csv" into a DataFrame, uses train_test_split to split the data into training and test sets, assigns features and target variables by dropping specific columns, and does one-hot encoding on categorical columns. It also checks if the encoded DataFrame contains the "Profit" column. The data splitting process is carried out if the 'Profit' column is located; if not, a message stating that the 'Profit' column is missing is printed. This section of code splits the data for model training and evaluation, encodes categorical variables, and handles missing columns to efficiently prepare the data for machine learning tasks.

import pandas as pd from sklearn.linear_model import LinearRegression # Load the dataset data = pd.read_csv('dataset_Superstore.csv') # Prepare the data for linear regression X = data[['Sales', 'Quantity', 'Discount']] # Selecting relevant independent variables y = data['Profit'] # Using 'Profit' as the dependent variable # Create and fit the linear regression model model = LinearRegression() model.fit(X, y) # Predict using the model predictions = model.predict(X)

Run to view results

To begin with, this code snippet imports the required libraries: LinearRegression from sklearn.linear_model for linear regression modelling and pandas as pd for data handling. 'dataset_Superstore.csv' is the dataset that is subsequently loaded into a DataFrame. After choosing the pertinent independent variables ("Sales," "Quantity," and "Discount") and designating "Profit" as the dependent variable, the data is ready for linear regression. Using the same dataset, a linear regression model is developed, fitted to the data, and then utilised to provide predictions. utilising the Superstore dataset, this code effectively carries out linear regression analysis, training a model to forecast profits based on sales, quantity, and discount values before utilising the model to produce predictions.

import pandas as pd from sklearn.linear_model import LinearRegression # Load the dataset data = pd.read_csv('dataset_Superstore.csv') # Extracting 'Sales' and 'Profit' columns for regression X = data['Sales'].values.reshape(-1, 1) # Independent variable y = data['Profit'].values # Dependent variable # Create a linear regression model model = LinearRegression() # Fit the model to the data model.fit(X, y) # Example prediction (replace with actual test data if needed) new_data = [[10]] # Example sales value for prediction prediction = model.predict(new_data) print(prediction)

Run to view results

The first step in this code snippet is to load the required libraries: LinearRegression from sklearn.linear_model for linear regression modelling and pandas as pd for data handling. 'dataset_Superstore.csv' is the dataset that is subsequently loaded into a DataFrame. The 'Sales' and 'Profit' columns are taken out of the dataset by the code, which then restructures the 'Sales' column to become an independent variable and designates 'Profit' as the dependent variable. The 'Sales' and 'Profit' columns are used to fit the data into a linear regression model. To show how to use the trained model to estimate profit based on sales data, an example prediction is produced on a new sales value of 10. Next, the projected profit figure is displayed on the console.

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Load the dataset data = pd.read_csv('dataset_Superstore.csv') # Select features (X) and target variable (y) X = data[['Sales', 'Profit']] # Features y = data['Returned'] # Target variable X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create a logistic regression model model = LogisticRegression() # Fit the model to the training data model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred) print("Accuracy:", accuracy) print("Classification Report:\n", report)

Run to view results

The libraries required for data processing, model selection, and evaluation are imported by this bit of code. It inserts the 'dataset_Superstore.csv' dataset into a DataFrame, chooses 'Profit' and 'Sales' as features (X), and 'Returned' as the designated variable (y). An 80/20 ratio is used to divide the data into training and testing sets. To forecast results on the test set, a logistic regression model is built, trained on the training set, and applied. Accuracy is used to gauge the model's performance, and a classification report including the F1-score, precision, recall, and support for each class is produced. Lastly, the code outputs the classification report and accuracy score to evaluate the model's predicted performance using the test set of data.

import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load the dataset data = pd.read_csv('dataset_Superstore.csv') # Prepare the data for linear regression X = data[['Sales', 'Quantity', 'Discount']] # Independent variables y = data['Profit'] # Dependent variable # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and fit the linear regression model model = LinearRegression() model.fit(X_train, y_train) # Predict using the model predictions = model.predict(X_test) # Calculate the mean squared error to evaluate the model mse = mean_squared_error(y_test, predictions) # Interpretation: The linear regression model shows how sales, quantity, and discount impact profit. # The coefficients of these variables indicate their influence on profitability. # A lower mean squared error suggests a better fit of the model to the data. # Further analysis and interpretation specific to the chosen modelling strategy can be added here.

Run to view results

'Sales', 'Quantity', and 'Discount' are selected as independent variables (X) and 'Profit' is chosen as the dependent variable (y) in order to prepare the data for linear regression. This code loads a dataset called 'dataset_Superstore.csv'. An 80/20 ratio is used to divide the data into training and testing sets. To forecast profits on the test set, a linear regression model is developed, trained on the training set, and applied. The model's performance is then assessed by computing the mean squared error, which sheds light on how well the model matches the data. The code also provides an interpretation that highlights the impact of quantity, sales, and discount on profit, highlighting the role of the coefficients on profitability—a lower mean squared error being indicative of a better fit between the model and the data.

Conclusion

Using Python's pandas unit, the dataset "dataset_Superstore.csv" was successfully loaded into a DataFrame during the first phase of data import and cleaning. This stage made it possible to investigate the structure and contents of the dataset in the beginning, giving future analysis a solid grasp. Next, steps were taken to assure data consistency and correctness through data cleaning operations. In order to simplify time-based analysis, date columns were changed to datetime format and duplicate values were checked for absence. Notably, data coherence was ensured by verifying that each "Order Date" value came before its matching "Ship Date" value. The 'Profit', 'Sales', and 'Quantity' columns were examined for outliers, and the extreme values were substituted with upper or lower boundaries derived from three standard deviations from the mean. Through the resolution of any potential abnormalities that can alter analytical findings, this procedure attempted to increase the dataset's integrity. Important variables such as profit, sales, quantity, and discount were given summary statistics that gave information about their distribution and characteristics within the dataset. Further study is guided by these data, which provide insight into the variability, ranges, and average values of key indicators.

Recommendations

It is advised to conduct additional research to examine market segments with negative profits but no returns in order to build on these findings. By looking at these cases, one can find the underlying causes of the unprofitability in the face of no returns and make strategic changes to increase profitability. Improved visualisation methods could be used to produce more plots that investigate correlations between various variables. Gaining deeper insights into the factors driving profitability can be achieved by visualising the correlations between multiple metrics, such as profit and sales across different sectors or geographies. Further, taking into account the use of sophisticated analytics methods like clustering or predictive modelling may provide more in-depth understanding of consumer behaviour patterns or pricing schemes. Businesses can optimise pricing strategies and operational efficiency by utilising machine learning algorithms for predictive analytics on sales patterns or client segmentation. This allows for data-driven decision-making. All things considered, a thorough methodology that combines in-depth analysis with cutting-edge analytics techniques can extract insightful information from the dataset and direct strategic decision-making for enhanced corporate performance.

Run to view results

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction

Data Importing

Data Cleaning

Outliers Detection

Summary Statistics

Data Plotting

Linear Regression

Conclusion

Recommendations

Introduction