Summary
This project explores how machine learning can be applied to predict the quality of wine based on various physicochemical properties. Using the UCI Wine Quality dataset, I trained and evaluated two models i.e. Decision Tree and Random Forest, to determine which performs better in predicting wine quality ratings. Additionally, we also determined important features influencing wine quality.
Workflow
Data Exploration and Preprocessing – The project begins with loading and exploring the dataset. The data was already clean, minimal preprocessing was needed before modeling.
Model Training: Decision Tree vs. Random Forest – We hypothesize that a Random Forest model, by aggregating the predictions of multiple decision trees, will generalize better and achieve higher accuracy and robustness than a single Decision Tree model. I trained both a Decision Tree and a Random Forest classifier to compare performance in terms of accuracy of the model.
Model Evaluation – The Random Forest outperformed the Decision Tree, confirming our initial hypothesis.
Feature Importance Analysis – I then extracted feature importances from the Random Forest model to understand which physicochemical variables had the strongest impact on predicted wine quality. A horizontal bar chart and ranked table were used for visual interpretation.
Conclusion
This project demonstrates my ability to build, compare, and interpret machine learning models using real-world datasets and visualize the decision-making process behind model predictions. The Random Forest model achieved higher accuracy than the Decision Tree, supporting our hypothesis that ensemble models can provide better generalization and performance. Key factors influencing wine quality included alcohol content, volatile acidity, and sulphates.
Python code for Predicting Wine Quality with Machine Learning
#Import wine quality data set from kaggle
!pip install kaggle
!kaggle datasets download -d uciml/red-wine-quality-cortez-et-al-2009
import zipfile
with zipfile.ZipFile("red-wine-quality-cortez-et-al-2009.zip", 'r') as zip_ref:
zip_ref.extractall("wine_data")
Requirement already satisfied: kaggle in c:\programdata\anaconda3\lib\site-packages (1.7.4.5) Requirement already satisfied: charset-normalizer in c:\programdata\anaconda3\lib\site-packages (from kaggle) (2.0.4) Requirement already satisfied: python-slugify in c:\programdata\anaconda3\lib\site-packages (from kaggle) (5.0.2) Requirement already satisfied: idna in c:\programdata\anaconda3\lib\site-packages (from kaggle) (3.2) Requirement already satisfied: bleach in c:\programdata\anaconda3\lib\site-packages (from kaggle) (4.0.0) Requirement already satisfied: webencodings in c:\programdata\anaconda3\lib\site-packages (from kaggle) (0.5.1) Requirement already satisfied: requests in c:\programdata\anaconda3\lib\site-packages (from kaggle) (2.26.0) Requirement already satisfied: python-dateutil>=2.5.3 in c:\programdata\anaconda3\lib\site-packages (from kaggle) (2.8.2) Requirement already satisfied: text-unidecode in c:\programdata\anaconda3\lib\site-packages (from kaggle) (1.3) Requirement already satisfied: protobuf in c:\programdata\anaconda3\lib\site-packages (from kaggle) (6.31.1) Requirement already satisfied: six>=1.10 in c:\programdata\anaconda3\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi>=14.05.14 in c:\programdata\anaconda3\lib\site-packages (from kaggle) (2021.10.8) Requirement already satisfied: urllib3>=1.15.1 in c:\programdata\anaconda3\lib\site-packages (from kaggle) (1.26.7) Requirement already satisfied: setuptools>=21.0.0 in c:\programdata\anaconda3\lib\site-packages (from kaggle) (58.0.4) Requirement already satisfied: tqdm in c:\programdata\anaconda3\lib\site-packages (from kaggle) (4.62.3) Requirement already satisfied: packaging in c:\programdata\anaconda3\lib\site-packages (from bleach->kaggle) (21.0) Requirement already satisfied: pyparsing>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from packaging->bleach->kaggle) (3.0.4) Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.4) Dataset URL: https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009 License(s): DbCL-1.0 red-wine-quality-cortez-et-al-2009.zip: Skipping, found more recently modified local copy (use --force to force download)
import pandas as pd
# Load the CSV
df = pd.read_csv("wine_data/winequality-red.csv", sep=',')
df.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
Import libraries, modules and sub-modules
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
print("Done importing")
Done importing
#Preparing data for supervised learning model
X = df.drop("quality", axis=1)
y = df["quality"]
#Split the data. 20% test data, 80% train data, random state allows reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
A Decision Tree is a supervised machine learning algorithm that predicts outcomes by recursively splitting data based on feature values. In contrast, a Random Forest is an ensemble method that combines predictions from multiple decision trees, making the overall model more robust and less prone to overfitting. Because a Random Forest aggregates results from diverse trees, it tends to generalize better and produce more reliable predictions than a single tree.
Hypothesis: The prediction accuracy of a Random Forest model will be higher than that of a single Decision Tree on the same dataset, due to its ability to reduce overfitting through ensemble learning.
#Train a Single Decision Tree
tree = DecisionTreeClassifier(random_state=40)
tree.fit(X_train, y_train)
#Making predictions with a tree
tree_preds = tree.predict(X_test)
print(tree_preds)
[5 6 6 7 6 5 6 6 5 5 7 6 7 7 5 7 4 6 5 5 5 6 5 5 6 6 6 5 5 5 5 3 5 5 5 6 6 5 4 5 5 6 6 6 5 6 5 6 6 6 6 5 5 5 6 5 6 5 6 6 7 6 6 6 4 5 5 5 7 7 6 5 4 7 7 5 5 6 5 7 6 7 5 6 6 5 6 5 7 5 6 7 6 5 6 6 5 6 5 6 5 6 5 6 7 5 7 6 6 6 7 5 5 6 6 5 5 6 6 5 5 5 5 5 5 6 4 5 6 5 5 5 7 5 5 5 5 5 5 5 6 4 6 5 4 6 5 6 7 7 6 6 6 6 5 5 5 7 6 6 6 5 7 7 6 5 5 6 6 6 6 7 5 5 5 5 6 5 5 5 7 5 4 7 5 5 6 6 6 7 7 5 5 5 5 5 5 5 6 7 5 5 5 6 6 6 6 5 6 5 7 5 6 6 5 6 6 6 6 6 6 7 4 5 6 5 5 6 5 6 4 6 7 6 6 6 5 5 6 6 7 8 7 6 6 7 7 6 5 5 6 5 5 5 5 6 6 6 5 7 5 5 3 6 5 6 6 6 6 5 4 5 6 7 6 7 6 4 5 4 6 6 5 5 5 6 6 7 7 5 6 6 7 6 5 5 5 5 5 5 6 6 6 5 5 5 5 7 6 5 6 7 6 5 5 6 6 5 7 5]
#Train a RandomForest
forest = RandomForestClassifier(random_state=40)
forest.fit(X_train, y_train)
#Making predictions with a tree
forest_preds = forest.predict(X_test)
print(forest_preds)
[6 5 6 6 6 5 6 6 5 6 6 5 6 7 5 6 6 6 5 5 5 6 5 5 7 5 6 5 5 5 6 6 6 5 5 6 6 5 5 6 5 6 5 6 6 5 6 6 6 5 6 5 5 5 6 5 6 5 6 5 6 6 6 6 5 5 5 5 6 7 5 5 6 7 6 5 5 6 5 6 6 6 5 5 6 5 5 5 6 5 6 7 6 5 6 5 5 7 5 6 5 6 5 6 7 5 6 6 5 6 7 5 6 6 6 5 5 6 6 5 5 5 5 5 5 6 5 5 6 5 6 5 7 5 5 5 5 6 5 5 6 5 6 5 5 6 5 6 7 7 6 6 6 6 6 5 6 7 6 6 6 5 7 7 6 5 5 5 6 6 5 7 5 5 5 5 6 5 5 5 7 5 5 6 5 5 6 6 6 7 7 5 5 5 6 5 5 5 6 7 5 5 7 6 6 6 6 5 6 6 7 5 5 6 5 6 6 5 6 6 7 7 5 6 6 5 5 6 5 5 5 5 6 6 6 6 5 5 6 5 6 7 6 6 6 7 7 6 7 5 6 5 5 5 5 6 6 5 6 6 5 5 6 6 5 5 6 6 6 5 5 5 5 6 6 7 6 5 5 5 5 5 5 5 5 5 5 6 7 6 6 5 5 6 5 5 5 5 5 5 6 6 6 6 5 5 5 7 6 6 6 6 6 5 6 6 6 5 6 5]
tree_acc = accuracy_score(y_test, tree_preds)
forest_acc = accuracy_score(y_test, forest_preds)
print(f"Decision Tree Accuracy: {tree_acc:.3f}")
print(f"Random Forest Accuracy: {forest_acc:.3f}")
Decision Tree Accuracy: 0.637 Random Forest Accuracy: 0.725
The Random Forest model achieved an accuracy of 72.5%, while the Decision Tree achieved only 63.7% on the same test data. This confirms our hypothesis that a Random Forest, by combining the predictions of multiple decision trees, provides better generalization and is more accurate and robust than a single decision tree.
Conclusion: Averaging predictions from multiple decision trees in a random forest reduces overfitting and leads to improved accuracy compared to using a single decision tree model.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(tree, feature_names=X.columns, class_names=[str(c) for c in sorted(y.unique())], filled=True)
plt.title("Decision Tree Visualization")
plt.show()
# Get the first tree from the forest
one_tree = forest.estimators_[0]
plt.figure(figsize=(20, 10))
plot_tree(one_tree, feature_names=X.columns, class_names=[str(c) for c in sorted(y.unique())], filled=True)
plt.title("One Tree from the Random Forest")
plt.show()
#Extract feature importances
importances = forest.feature_importances_
feature_names = X.columns
#Create a DataFrame of feature importance
importance_df = pd.DataFrame({
"Feature": feature_names,
"Importance": importances
}).sort_values(by="Importance", ascending=False)
#Plot the feature importances
plt.figure(figsize=(10, 6))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Feature Importance")
plt.title("Most Influential Features for Wine Quality (Random Forest)")
plt.gca().invert_yaxis()
plt.grid(True)
plt.tight_layout()
plt.show()
#Show ranked features
print(importance_df)
Feature Importance 10 alcohol 0.142590 6 total sulfur dioxide 0.112105 9 sulphates 0.107973 1 volatile acidity 0.101464 7 density 0.093910 4 chlorides 0.081667 0 fixed acidity 0.077245 2 citric acid 0.076160 8 pH 0.071010 3 residual sugar 0.069333 5 free sulfur dioxide 0.066543
A higher importance score indicates that the feature contributes more to making accurate predictions. In this case, alcohol is the most influential feature for Wine Quality while free sulfur di oxide is the least one.
