Reading In the Dataset

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn import linear_model df = pd.read_csv('winequality-red.csv', delimiter=';') df

df.corr()

Initial Setup

# df.keys() fixedacid = df['fixed acidity'].values volatile = df['volatile acidity'].values citric = df['citric acid'].values sugar = df['residual sugar'].values chlorides = df['chlorides'].values freeso2 = df['free sulfur dioxide'].values totalso2 = df['total sulfur dioxide'].values density = df['density'].values ph = df['pH'].values sulphates = df['sulphates'].values aclohol = df['alcohol'].values quality = df['quality'].values

X = np.array([fixedacid, volatile, citric, sugar, chlorides, freeso2, totalso2, density, ph, sulphates, aclohol]).T #Note alcohol = X[:,10] y = quality plt.plot(X[:,10], y, '.') plt.xlabel('Alcohol') plt.ylabel('Quality') plt.title('Perliminary Correlation Plot')

Test Train Split

(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

Building a Model

#Finding the best value for alpha alpha = np.logspace(-16,-1,10000) scores = [] for a in alpha: model = linear_model.Lasso(alpha=a, max_iter=500000) model.fit(X_train, y_train) score = model.score(X_train, y_train) scores.append(score) plt.plot(a, score, '.') print(f'max score: {np.max(scores)} at index {np.argmax(scores)} where alpha = {alpha[np.argmax(scores)]}')

Model Analysis

maxalpha = alpha[np.argmax(scores)] model = linear_model.Lasso(alpha=maxalpha, max_iter=50000) model.fit(X_train, y_train) model.score(X_train, y_train)

plt.plot(X[:,10], y, 'ko', X_train[:,10], model.predict(X_train), 'r.', X_test[:,10], model.predict(X_test), 'b.') plt.xlabel('Alcohol') plt.ylabel('Quality') plt.show()

Summary

The model works by inputting the chemical components of a red wine (such as pH, citric acid content, alcohol content, etc.) and outputting a prediction for the quality of said wine on a scale from 0-10 with an error of approximately 1. In the visuals above I tried to focus on alcohol content vs wine quality because a preliminary assessment showed strongest correlation with respect to quality.