# visualisation
import matplotlib.pyplot as plt
# data analysis
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
Step one is to retrieve the data and have a look at its structure. The sklearn library has many datasets available which can be loaded. Here we load the Wine dataset and take a look at the description of the dataset to learn some more about what is contained within:
wine = datasets.load_wine()
print(wine['DESCR'][:500])
print(wine.data.shape)
print(wine.target.shape)
print(wine.target_names)
So far we have learned that the dataset contains 178 samples (rows) and 13 features (columns) that detail various components contained within the wine. There are three classes named 'class_0', 'class_1', and 'class_2'. The next step is to build a pandas DataFrame to hold the data. The dataframe will help when manipulating the features.
features = pd.DataFrame(data=wine['data'],columns=wine['feature_names'])
data = features
data['target']=wine['target']
data['class']=data['target'].map(lambda ind: wine['target_names'][ind])
data.head()
The lambda function is used within the map function to substitute each value in the 'target' series for corresponding target names in the wine dataset. The head() function returns the first $n$ rows, which is by default 5. This is useful when checking that the dataframe is holding the data of interest, in a format we understand.
data.describe()
The describe() function in Pandas generates descriptive statistics. From the example above, the highest alcohol percentage value is 14.83%, and the lowest is 11.03%, while the mean is 13%. From here we can look at the distribution of a feature, lets have a look at the distribution of total phenol content. The phenolic content of wine includes a large group of several hundred chemical compounds that affect the taste, color and mouthfeel of the wine. Using the seaborn distplot function:
sns.distplot(data['total_phenols'],kde=0)
The above figure plots the distrbution of total phenols among all of the wines.
for i in data.target.unique():
sns.distplot(data['total_phenols'][data.target==i],
kde=1,label='{}'.format(i))
plt.legend()
The distribution of total phenols content by class. The three classes appear to naturally separate as low/mid/high total phenols distributions. Next we can look at the distributions of the classes for the rest of the features. For this we can use Seaborn's .kdeplot() function to cleanly distinguish each class. However, note that this scales the y-axis so that the integral under each curve is 1.
import matplotlib.gridspec as gridspec
for feature in wine['feature_names']:
print(feature)
gs1 = gridspec.GridSpec(3,1)
ax1 = plt.subplot(gs1[:-1])
ax2 = plt.subplot(gs1[-1])
gs1.update(right=0.60)
sns.boxplot(x=feature,y='class',data=data,ax=ax2)
sns.kdeplot(data[feature][data.target==0],ax=ax1,label='0')
sns.kdeplot(data[feature][data.target==1],ax=ax1,label='1')
sns.kdeplot(data[feature][data.target==2],ax=ax1,label='2')
ax2.yaxis.label.set_visible(False)
ax1.xaxis.set_visible(False)
plt.show()
Kernel density estimation (KDE) is used here to visualize the "shape" of each feature. This represents an estimate of the distribution for each feature, and the result is a curve. This is related to a histogram in that it is a representation of distribution. KDE takes a parameter called "bandwidth" that affects the "smoothness" of the curve. The higher the bandwidth number, the smoother the curve. The "kernel function" represents how the data points distance are weighted. From the box plots of the Flavanoids, Total Phenols and Alcohol features, it is clear the class distributions have noticeably different means. From this we can go on to expect that simple models may be sufficent to distinguish between wines.
The wine data is now loaded and some exploration has yielded some useful insights. The next step is to import and instantiate some classification algorithms. From this list of algorithms we can choose the most appropriate and strongest performing to distinguish between wines.
It is not a good idea to evaluate the model with the same data used to build the model. In order to see if the model will generalize well (perform well on new data) it will be stronger to use a different dataset. This can be achieved by splitting the dataset into two parts:
Within the scikit-learn library there exists a function that can split datasets. This function is called train_test_split. This function extracts 75% of the initial dataset, including the corresponding labels of this portion of the data. Whats remains is 25% of the original dataset, together with the corresponding labels for this data. This is declared as the test set.
from sklearn.model_selection import train_test_split
data_train, data_test, label_train, label_test = \
train_test_split(wine['data'],wine['target'],
test_size=0.2)
print(len(data_train),' samples in training data\n',
len(data_test),' samples in test data\n', )
It is difficult to know which classifcation algorithm will perform the strongest. To make this problem more manageable, it is a good idea to look at multiple algorithms and pick the one that performs best. This can be done by creating a dict of all of the scikit-learn classifiers. A dict is a dictionary that is indexed by keys. Keys can be any immutable type, meaning they can be strings or numbers. Dicts can be thought of as key:value pairs, with the requirement that keys are always unique (within dicts).
GridSearchCV is used in the following code block to find the optimal hyperparameters for each of the classifcation algorithms included in the dict. A hyperparameter is a parameter whose value is used to control the learning process in machine learning. Some classifiers depend on one or more hyperparameters or regularization techniques whose optimal values are not known ahead of time. GridSearchCV optimizes the given parameters by cross-validating the grid search over the parameter grid:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
dict_classifiers = {
"Logistic Regression":
{'classifier': LogisticRegression(),
'params' : [
{
'penalty': ['l1','l2'],
'C': [0.001,0.01,0.1,1,10,100,1000]
}
]
},
"Nearest Neighbors":
{'classifier': KNeighborsClassifier(),
'params': [
{
'n_neighbors': [1, 3, 5, 10],
'leaf_size': [3, 30]
}
]
},
"Linear SVM":
{'classifier': SVC(),
'params': [
{
'C': [1, 10, 100, 1000],
'gamma': [0.001, 0.0001],
'kernel': ['linear']
}
]
},
"Gradient Boosting Classifier":
{'classifier': GradientBoostingClassifier(),
'params': [
{
'learning_rate': [0.05, 0.1],
'n_estimators' :[50, 100, 200],
'max_depth':[3,None]
}
]
},
"Decision Tree":
{'classifier': tree.DecisionTreeClassifier(),
'params': [
{
'max_depth':[3,None]
}
]
},
"Random Forest":
{'classifier': RandomForestClassifier(),
'params': {}
},
"Naive Bayes":
{'classifier': GaussianNB(),
'params': {}
}
}
Learning curves are used to get some infomation of a classifiers predictive power as a function of the number of training samples.
The learning_curve function in Scikit-learn plots the training score (the accuracy of the model on the training data) along with the cross-validation score (which measures the accuracy on data that is not included in the training set).
An ideal model should be able to capture most of the complexity of the training data (if not, the model may require less bias) and the validation score should increase with more training data. This behavior indicates that the model will generalize well as more data is collected.
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.6, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
The Plot title.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
Next it is time to do the actual training of the models.
The 'batch_classify' function below compares each classifier. For each classifier, an exhaustive grid search is performed where each combination of hyperparameters are cross-validated using StratifiedKfold with 10 folds. This means, the data is split into 10 sets, trained on 9 of them and tested on the rest which gives the cross-validation (CV) score. The grid search then selects the combination of parameters that gives the best CV score.
Once the optimal parameters are found, each model is fit to the full training set. Next the accuracy is calculated of the predictions on the training set, and the test set to see how well the model generalizes to data it has never seen. The models are sorted based on their predictions on the test set. This is the true test of a classifier for 'real world' data.
Lastly, the above learning curves are plotted.
import time
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
num_classifiers = len(dict_classifiers.keys())
def batch_classify(X_train, Y_train, X_test, Y_test, verbose = True):
df_results = pd.DataFrame(
data=np.zeros(shape=(num_classifiers,4)),
columns = ['classifier',
'train_score',
'test_score',
'training_time'])
count = 0
for key, classifier in dict_classifiers.items():
t_start = time.perf_counter()
grid = GridSearchCV(classifier['classifier'],
classifier['params'],
refit=True,
cv = 10, # 9+1
scoring = 'accuracy', # scoring metric
n_jobs = -1
)
estimator = grid.fit(X_train,
Y_train)
t_end = time.perf_counter()
t_diff = t_end - t_start
train_score = estimator.score(X_train,
Y_train)
test_score = estimator.score(X_test,
Y_test)
df_results.loc[count,'classifier'] = key
df_results.loc[count,'train_score'] = train_score
df_results.loc[count,'test_score'] = test_score
df_results.loc[count,'training_time'] = t_diff
if verbose:
print("trained {c} in {f:.2f} s".format(c=key,
f=t_diff))
count+=1
plot_learning_curve(estimator,
"{}".format(key),
X_train,
Y_train,
ylim=(0.75,1.0),
cv=10)
return df_results
df_results = batch_classify(data_train, label_train, data_test, label_test)
display(df_results.sort_values(by='test_score', ascending=False))
From analysis of the test score results, all of the classifiers perform reasonably well except for K Nearest Neighbors. Naive Bayes performs at almost the same level as the more complex models, and has the added benefit of executing much faster.
Random Forests, LinearSVM, and Gradient Boosting classifiers almost perfectly predict the training set, meaning they have low enough bias in order to capture all of the nuance of the data.
From the above analysis it seems like Naive Bayes, Random Forest, GradientBoosting, and LinearSVM would all be adequate choices as 'real-world' models for wine classification that could reliably predict wine classes >95% of the time.