Reading In the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df = pd.read_csv('winequality-red.csv', delimiter=';')
df
df.corr()
Initial Setup
# df.keys()
fixedacid = df['fixed acidity'].values
volatile = df['volatile acidity'].values
citric = df['citric acid'].values
sugar = df['residual sugar'].values
chlorides = df['chlorides'].values
freeso2 = df['free sulfur dioxide'].values
totalso2 = df['total sulfur dioxide'].values
density = df['density'].values
ph = df['pH'].values
sulphates = df['sulphates'].values
aclohol = df['alcohol'].values
quality = df['quality'].values
X = np.array([fixedacid, volatile, citric, sugar, chlorides, freeso2, totalso2, density, ph, sulphates, aclohol]).T
#Note alcohol = X[:,10]
y = quality
plt.plot(X[:,10], y, '.')
plt.xlabel('Alcohol')
plt.ylabel('Quality')
plt.title('Perliminary Correlation Plot')
Test Train Split
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
Building a Model
#Finding the best value for alpha
alpha = np.logspace(-16,-1,10000)
scores = []
for a in alpha:
model = linear_model.Lasso(alpha=a, max_iter=500000)
model.fit(X_train, y_train)
score = model.score(X_train, y_train)
scores.append(score)
plt.plot(a, score, '.')
print(f'max score: {np.max(scores)} at index {np.argmax(scores)} where alpha = {alpha[np.argmax(scores)]}')
Model Analysis
maxalpha = alpha[np.argmax(scores)]
model = linear_model.Lasso(alpha=maxalpha, max_iter=50000)
model.fit(X_train, y_train)
model.score(X_train, y_train)
plt.plot(X[:,10], y, 'ko',
X_train[:,10], model.predict(X_train), 'r.',
X_test[:,10], model.predict(X_test), 'b.')
plt.xlabel('Alcohol')
plt.ylabel('Quality')
plt.show()
Summary
The model works by inputting the chemical components of a red wine (such as pH, citric acid content, alcohol content, etc.) and outputting a prediction for the quality of said wine on a scale from 0-10 with an error of approximately 1. In the visuals above I tried to focus on alcohol content vs wine quality because a preliminary assessment showed strongest correlation with respect to quality.