Building 10 Classifier Models in Machine Learning +Notebook

A detailed process of building Classifier Models

Foto de Magda Ehlers no Pexels

In the last tutorial, we completed the Data Pre-Processing step. We saw preprocessing techniques applied in transformation and variable selection, dimensionality reduction, and sampling for machine learning throughout this previous tutorial.

Now we can move on to the next steps within the Data Science process, where we’ll apply the rest of the model building process with various classification algorithms to understand what it is and how to use machine learning with python language. In the next moment, we will discuss the Regression algorithms.

We will not go into detail about the algorithms. The purpose here will be to understand the detailed process of building the Machine Learning model, machine learning, model evaluation, and prediction scans.

Jupyter Notebook

See The Jupyter Notebook for the concepts we’ll cover on building machine learning models and my LinkedIn profile for other Data Science articles and tutorials.

Evaluating Performance

The metrics chosen to evaluate model performance will influence how performance is measured and compared to models created with other algorithms. We need to find a metric to measure performance between models solidly and coherently, a metric comparable to the models analyzed. Let’s use the same algorithm, but with different metrics, and so compare the results.

Metrics for Classification Algorithms

Accuracy is undoubtedly the most widely used metric. If we have a model with 85% accuracy, every 100 predictions, a model hits correctly 85 of the time. The function cross_val_score() will be used to evaluate performance. If we have an unbalanced dataset, the accuracy may fail.

# Logistic Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation 
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = LogisticRegression()
# Cross Validation - Scoring = 'accuracy'
result = cross_val_score(model, X, Y, cv = kfold, scoring = 'accuracy')
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 77.08%

Above, we create the model and check its accuracy. We import the modules, load the data, divide X (input)and Y(output), define some parameters, create KFold — the divisions we use in cross-validation, create the logistic regression model, and print the result.

This was the same process repeated several times in the data preprocessing step. Therefore, for every 100 predictions, the model hits 77 of them. Accuracy is the most straightforward metric of all, and we set the scoring parameter as accuracy; that is, when we run the cross_val_score function, we can tell which metrics we want to use within cross-validation.

A single line of code can train the algorithm, test the model, and consider accuracy as a performance assessment metric. Cross-validation is undoubtedly an excellent technique for working with Machine Learning.

AUC (Binary Classification)

Allows you to analyze the AUC — area under the curve metric. AUC is a performance metric for binary classification, where we can set classes to positive and negative. Binary classification problems are a trade-off between Sensitivity and Specificity:

  • Sensitivity is the rate of True Positive (TP). This is the number of positive instances of the first class that the model predicted correctly.
  • Specificity is the True Negative rate (TN). This is the number of instances of the second class that the model predicted correctly.

Values above 0.5 indicate a good prediction rate.

# AUC Curve
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating Logistic Regression model
model = LogisticRegression()
# Cross Validation - scoring ='roc_auc'
result = cross_val_score(model, X, Y, cv = kfold, scoring = 'roc_auc')
# Apllying the average for the result
print("AUC: %.3f" % (result.mean() * 100))
# Result: 82.56%

We are working with the same model so that we have the same parameter of comparison. The only difference we have in cross_val_score is that we changed the accuracy scoring ‘accuracy’ metric to ‘roc_auc.’ We have an 82% accuracy rate, a model with a high accuracy rate.

Confusion Matrix

Both accuracy and the AUC curve take into account the confusion matrix, showing precisely the results of the predictions of our model.

# Confusion Matrix
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Loading Data
data = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets instead cross_val_score
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
# Training model
model.fit(X_train, Y_train)
# Making predictions
predictions = model.predict(X_test)
matrix = confusion_matrix(Y_test, predictions)
# Printing confusion matrix
print(matrix)
# [[141  21]
# [ 41 51]]
# 151, 51 indicates correct prediction
# 21, 41 indicates wrong predictions

Instead of using cross-validation, we use train_test_split from model_selection. We create the logistic regression model, do the training, and end the predictions with the test data. As output, we have the confusion matrix that indicates that 141 and 51 are the correct answers of the model and 21 and 41 are the model’s errors; that is, the model is hitting more than missing.

Classification Report

We can use an alternative to print a classification report with multiple concurrent metrics instead of printing each metric individually:

# Classification Report
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
model.fit(X_train, Y_train)
# Making predictions
prediction = model.predict(X_test)
# Setting classification report
report = classification_report(Y_test, prediction)
# Printing report
print(report)

precision recall f1-score support
0.0 0.77 0.87 0.82 162
1.0 0.71 0.55 0.62 92
micro avg 0.76 0.76 0.76 254
macro avg 0.74 0.71 0.72 254
weighted avg 0.75 0.76 0.75 254

We load the data, divide it into X and Y, define the parameters, split it into training and testing subsets, train the model, test it and finally call the classification_report — we present the Y test (what’s already in our data set). We take the predictions of our model and compare it with the data that we already know the result. With this, we calculate the final performance and deliver the performance report with other metrics.

We will hardly create just one version of the model; we will create up to 20 versions of a model until we reach a better result comparing performance metrics.

Creating Classification Models.

From now on, we’ll see how to create various machine learning models for classification. Building the Machine Learning model is the most straightforward step in the process; already working with pre-processing is infinitely more laborious than creating the model itself. One of the first steps within the Machine Learning project is to define whether we are facing a Classification, Regression, or, eventually, unsupervised learning problem.

Algorithms have a difference concerning supervised and unsupervised learning, and besides, there is also the difference within supervised learning in classification and regression algorithms. Everything we’re doing so far refers to sorting algorithms.

Classification Algorithms

We have no way of knowing which algorithm will work best to construct the model before testing the algorithm with our dataset. A comparison metric will test several algorithms and select the best model, a performance evaluation metric.

The ideal is to test some algorithms and then choose the one that provides the best level of accuracy. Let’s try a set of sorting algorithms under the same conditions.

1. Logistic Regression (Classification)

It is a Linear Algorithm that allows you to divide the data into two or more categories: the output classes according to the problem.

The Logistic Regression algorithm assumes that the data is in a Normal Distribution for numeric values that the algorithm can model with binary classification.

# Logistic Regression Classifier
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = LogisticRegression()
# Standard Cross Validation - accuracy
results = cross_val_score(model, X, Y, cv = kfold)
# Print result
print("Accuracy: %.3f" % (results.mean() * 100))
# Result: 77.08%

The Logistic Regression algorithm has an accuracy of 77%. We use cross-validation with cross_val_score, which does the training and directly already tests the model. Because the metric has not been broken down, the default metric cross_val_score the measure that is accuracy.

2. Linear Discriminant Analysis (Binary Classification)

It is a Linear algorithm for binary classification. It also assumes that the data is in Normal Distribution; that is, LDA and logistic regression expect to receive the data in a standard format. We need to check this in exploratory analysis and apply data standardization to leave them with a mean 0 and standard deviation of 1.

# Linear Discriminant Analysis
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data in folds
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
LinearDiscriminantAnalysis()
# Cross Validation
resultado = cross_val_score(model, X, Y, cv = kfold)

# Print result
print("Linear Disciminant Analysis Accuracy: %.3f" % (result.mean() * 100))
# Result: 76.69%

The difference between this code to the previous one is merely the change of algorithm. We apply the Linear Discriminant Analysis algorithm, which has a slightly lower accuracy than the Logistic Regression model.

3. KNN (K-Nearest Neighbors)

It is a nonlinear algorithm that uses a distance metric to find the k value that best suits the instances of the training dataset. For each data point, KNN calculates a Euclidean distance for each point, places it in a table, and then makes predictions for new data.

# KNN
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
random_state = 7
# Separating data into folds for Cross-Validation
Kfold = KFold(num_folds, True, random_state = random_state)
# Creating model
model = KNeighborsClassifier()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
print("KNN K-Nearest Neighbor Accuracy: %.3f" % (result.mean() * 100))
#Result: 76.69%

We changed the algorithm again. This time we import the KneighborClassifier that belongs to the neighbor’s package. We achieved performance similar to the previous LDA algorithm.

4. Naive Bayes (Probabilistic Algorithm)

Another Nonlinear algorithm. The Naive Bayes is a very famous probabilistic algorithm. Calculates the probability of each class and the conditional probability of each class to sort the data. The algorithm assumes data in Gaussian distribution (Normal) with GaussianNB.

# Naive Bayes 
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = GaussianNB()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
print("Naive Bayes Accuracy: %.3f" % (result.mean() * 100))
# Result: 75.91%

5. CART (Classification and Regression Trees)

It is a Non-Linear algorithm that builds a binary tree from the training dataset. Each attribute and each value of each feature are evaluated to reduce the cost function.

# CART (Classification and Regression Tree)
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating DecisionTree model
model = DecisionTreeClassifier()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# CART Accuracy
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 69.80%

6. SVM — Vector Machines Support

It’s one of the most fantastic machine learning algorithms there is. The SVM takes the data in a non-linear dimension, adds another dimension, goes up the dimension data, and sorts it. To do this, we use the SVC of the SVM package.

# Support Vector Machine
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating SVC model
model = SVC()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Result Support Vector Machine Classifier
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 65.11%

Although SVM is one of the best algorithms in existence, it has had the worst accuracy yet. However, the more complex the algorithm, the more sensitive it is to preprocess the data — no preprocessing has been done here. Therefore, it is understandable the poor performance of the SVM.

What have we done?

What we’ve done so far, in practice, from one model to another, is change the algorithm used. The rest was pretty much the same thing.

We can automate this work of experimenting and testing several different algorithms through programming language; After all, Python offers several programming features to automate this work.

Selecting the best predictive model.

Now we’ll create a piece of code that will execute everything we’ve done above, but in an automated way. Next, we will compare all the models and choose the one with the best performance. In this way, we automate our work and quickly experiment and test several different algorithms.

# Importing all algorithms used
from pandas import read_csv
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Loading Data
file = 'pima-data.csv'
file = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Creating empty list 
models = []
# Machine Learning Algorithms list
models.append(('LR', LogisticRegression())) #binary class
models.append(('LDA', LinearDiscriminantAnalysis())) #binary class
models.append(('NB', GaussianNB())) # binary class
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
# Evaluating each model in a loop
results = [] # result list
names = [] # names list
for name, model in models:
kfold = KFold(n_splits = num_folds, random_state = seed)
cv_results = cross_val_score(model,
X,
Y,
cv = kfold,
scoring = 'accuracy')
results.append(cv_results) 
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot results
fig = plt.figure()
fig.suptitle('Comparison of Classification Algorithms')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

We created two lists — a list for results and another list for the names of machine learning algorithms.

We then create a for loop to scroll through the list of templates. For each item in the list, KFold will be split to generate data division into training and testing, run cross_val_score and append. The result of cross_val_score is included with the append function in the empty list of products next to the names. We stay in this loop for each algorithm.

With the code above, we trained and tested six different algorithms. The LR and LDA algorithms are the ones that presented the best performance through accuracy above 76%.

We evaluate the models by putting the results in boxplots. The yellow lines within each boxplot represent exactly accuracy, and on the x-axis, we have the names of the algorithms.

This machine learning building is to unite knowledge from various areas. We combine knowledge of computer programming, machine learning, pre-processing data, modeling, business problems — all to be able to create multiple models, compare their performances and then select the best model. Now we can optimize our machine learning model.

Model Optimization

After we create the model, we can still try to optimize it by adjusting the hyperparameters.

We are working on a binary classification problem; that is, we have two possible outputs. For binary classification, Logistic Regression, Linear Discriminant Analysis, and Naive Bayes are, in general, the best options.

On the other hand, if we work with multi-class classification, that is, multiple output classes. Rather than predicting whether or not a person will have diabetes, we will try to predict the disease’s category — early stage, intermediate, advanced, or develop the disease; we could create 4 or 5 different classes. An SVM or neural network could likely be a more interesting option.

Once one of the models has been selected, we can proceed to model optimization or hyperparameter adjustment. All machine learning algorithms are parameterized, which means that we can adjust the predictive model’s performance by tuning the parameters, i.e., fine-tuning the parameters.

Our job is to find the best combination of parameters in each machine learning algorithm. This process is also called Hyperparameter Optimization, and scikit-learn offers two methods for automatic parameter optimization:

  • Grid Search Parameter Tuning
  • Random Search Parameter Tuning.

Each Machine Learning algorithm comes with a set of parameters that we can technically call hyperparameters, which are ways to change the algorithm’s behavior — the problem is that we don’t know the best combination of hyperparameters for each dataset and each business problem.

With that in mind, sklearn developers have created other automatic options to test various combinations of hyperparameters. We can find the variety of hyperparameters that best fits the business and dataset problem we’re working on through this testing process. It is worth testing this procedure after choosing one of the models.

Two main methods for optimizing models

We will use the Logistic Regression algorithm to elucidate the problem, although the Linear Discriminant Analysis has shown better accuracy than the LogisticRegression algorithm. In practice, the performance was almost identical.

Because logistic regression offers more hyperparameters and is easier to adjust than LDA, let’s use this. In the end, the choice rests exclusively with the Data Scientist; that is, the tools serve only as support, but the final decision of how the analysis process will be. At the moment, we have several models created, we who choose on top of which model to work.

1. Grid Search Parameter Tuning

This method methodically performs combinations between all algorithm parameters, creating a grid, a table of combinations. This table will test several hyperparameters, but we have to indicate which parameter values we want for the grid search to try the combination.

We’ll import the GridSearchCV function from the model_selection and the LogisticRegression algorithm, load the data and divide it into X and Y.

Then we create the grid, which is a dictionary with a set of key: value pairs. We have the ‘penalty’ key named the hyperparameter and a list of values [11, 12]. The name of another hyperparameter named ‘C’ is followed by a list of values to test for that hyperparameter.

To know the name of these hyperparameters, we must access the documentation of the algorithm we are working on in the LogisticRgression case. Each algorithm will have a specific set of hyperparameters.

IMAGEM DOCUMENTACAO

The above values are the default values for each algorithm parameter used by scikit-learn. When we work with regression, we can penalize the algorithm for preventing it from suffering from over adjustment, while the “C” is the inverse of the regularization force. That is, it is directly linked to the penalty.

# Grid Search Parameter Tuning
from pandas import read_csv
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting Hyperparameters
grid_values = {'penalty': ['l1','l2'],
'C': [0.001,0.01,0.1,1,10,100,1000]}
# Creating model
model = LogisticRegression()
# Creating grid
grid = GridSearchCV(estimator = model, param_grid = grid_Values)
# Training grid
grid.fit(X, Y)
print("Accuracy: %.3f" % (grid.best_score_ * 100))
print("Best Model Parameters:", grid.best_estimator_)
# Result: 77.08%
# Best Model Parameters: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

Therefore, we instantiate the LogisticRegression model, call the GridSearchCV function, indicate that we will use the model we have just created and use as param_grid the grid_values set with penalty and C.

After that, we feed GridSearchCV with the X and Y sets, applying fit. We run the cell and return an accuracy of 77%, indicating the best combination of hyperparameters with GridSearchCV optimization.

2. Random Search Parameter Tuning

Another option is to use the Random Search Parameter Tuning method, which generates samples of algorithm parameters from a uniform random distribution to a fixed iteration number. A model is constructed and tested for each combination of parameters; that is, this method randomly seeks the combination of parameters.

Generally, the RandomizedSearchCV method is slightly slower than the GridSearchCV method but may show a better result.

# Random Search Parameter Tuning
from pandas import read_csv
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
seed = 7
iterations = 14
grid_values = {'penalty': ['l1','l2'],
'C': [0.001,0.01,0.1,1,10,100,1000]}
# Creating model
model = LogisticRegression()
# Creating grid for RandomizedSearch
rsearch = RandomizedSearchCV(estimator = model,
param_distributions = grid_values,
n_iter = iterations,
random_state = seed)
# Training Randomized Search
rsearch.fit(X, Y)
print("Accuracy: %.3f" % (rsearch.best_score_ * 100))
print("Best Model Parameters:", rsearch.best_estimator_)
# result 77.08%
# Best Model Parameters: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn',n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

When we now call the RandomizedSearchCV function, we have a few more parameters to specify: the model, grid_values, number of iterations (more performance, more time consuming, more computational resource), and then we do the fit with X and Y.

When running, we have accuracy with RandomizedSearchCV of 77%, equal to the accuracy of GridSearchCV, besides having found the same hyperparameters. Ideally, we should use GridSearch, which generally performs well and saves time.

Save and load the trained model.

So far, we’ve seen model optimization and hyperparameter tuning. Now, we’ll see how to save the result of our work, i.e., save the model designed to disk. If we close the Notebook, everything we run will have to run again; depending on the model, it may take days to be trained again.

At some point, we will also use this model to make new predictions. We will present new data to the model. We’ll soon need to load the saved model to disk and then perform our predictions. Saving the template and uploading it are two important actions within the entire process.

We will use the pickle package, which allows us to save the model in a specific binary format. We save in pickle format and then load in the same format.

# Saving the result
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
# Training model
model.fit(X_train, Y_train)
# Saving the template in the templates director
save_file = 'documents/final_classifier.sav'
pickle.dump(model, open(save_file, 'wb'))
print("Model saved.")
# Loading the pickle.load file
final_classifier = pickle.load(open(file, 'rb'))
model_prod = final_classifier.score(X_test, Y_test)
print("Model loaded.")
print("Accuracy: %.3f" % (model_prod.mean() * 100))
# Result 75.59%

Therefore, we load the data, divide it into X and Y, split the dataset into training and testing with train_test_split, create the logistic regression model, train the model with a fit, and finally save the model.

We save the template in the documents directory and give the name final_classifer through a pickle.dump to download the template’s contents and write to the file, precisely on the save_file object and the ‘wb’ write privilege. Model saved.

Then, to load the model, we use the pickle.load method, indicate the opening with the open function and point to the file we save as save_file.

Optimizing Performance with Ensemble Methods

So far, we have experienced several individual algorithms. Now we will try to use two or more algorithms and work together in a single package.

The Ensemble method is a set of algorithms that works as if it were a single package, taking several different algorithms and achieving better performance. We have three main categories for these methods:

Bagging: serves to build multiple models (typically of the same type) from different subsets in the training dataset, and each sample will apply multiple machine learning models.

Boosting: is used for building multiple models (typically of the same type), where each model learns to correct the errors generated by the previous model within the sequence of created models. This category often performs well, the algorithms create models throughout training that generate error rates, and the algorithm will use these error rates to train the next model — they are models of the same type. Still, they will learn from the errors of the previous model.

Voting: The construction of multiple models (usually of different types) and simple statistics (such as average) are used to combine predictions. It is a voting system; that is, we take several completely different algorithms and work in parallel; at the end, a vote is made.

Using the ensemble method is no guarantee that we will achieve greater accuracy.

We should devote ourselves to pre-processing the data regardless of the algorithm we are dealing with.

If we can’t achieve our goal with unique algorithms, the accuracy we determine for the project. We need to experiment with other machine learning models and architectures until we can beat the targets or even conclude that we can’t achieve the desired accuracy with the data we have.

7. Bagged Decision Tree

It is a method that works when there is a high variance in the data. We need to import the BaggingClassifier algorithm from the ensemble package and the DecisionTreeClassifier from the tree package.

We load the data, divide it into X and Y, define parameters, create folds for cross-validation, and create a Machine Learning model. A decision tree bears the name of DecisionTreeClassifier.

# Bagged Decision Tree
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# Loading Data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator = cart,
n_estimators = num_trees,
random_state = seed)
result = cross_val_score(modelo, X, Y, cv = kfold)
print("Bagged Decision Tree Accuracy: %.3f" % (result.mean() * 100))
# Result 75.91%

Next, we define how many trees will be created and feed the BaggingClassifier. The DecisionTreeClassifier model is considered a weak classifier; it is a loose machine learning algorithm. The BaggingClassifier will receive the cart estimator, multiply by 100x through n_estimators and use the seed to reproduce the same results.

After that, we applied the cross-validation and performed the training, reaching an accuracy of 75.9%. When comparing individual algorithms’ accuracy, we can see that BaggingClassifier surpassed the SVM, CART, KNN, and Naive Bayes, considering that we had not done any specific pre-processing.

Therefore, we have reached 4 of the six individual algorithms using simply the ensemble method. If we pre-process and adjust the Hyperparameters, the Ensemble method will most likely exceed 80% accuracy with BaggingClassifier.

8. Random Forest Classifier

This algorithm is an excellent choice for selecting variables. Random Forest is an extension of the Bagging Decision Tree.

In practice, Random Forest is in the Bagging category, where we put several weak classifiers to work together and formulating a more robust algorithm.

# Random Forest Classifier
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting trees
num_trees = 100
max_features = 3
# Separating data into folds for Cross-Validationk
fold = KFold(num_folds, True, random_state = seed)
# Creating Model
model = RandomForestClassifier(n_estimators = num_trees,
max_features = max_features)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print Result
print("Random Forst Accuracy: %.3f" % (result.mean() * 100))
# Result 77.34%

Therefore, we import the RandomForestClassifier algorithm from the ensemble package, load the data, divide X and Y, define the parameters and folds of cross-validation, create the RandomForestClassifier model, and finally,,, cross-validate. We have 77% accuracy, surpassing the accuracy of Logistic Regression and having practically the same level of LDA.

9. AdaBoost

Algorithms based on Boosting Ensemble. The two RandomForestClassifier algorithms in BaggingClassifier are in the Bagging category, while AdaBoost is in the Boosting category.

Boosting algorithms create a sequence of models that attempt to correct their errors based on previous models within the sequence. Once created, the models make predictions that can receive a weight according to their accuracy, and the results are combined to create a single final prediction.

The Boosting category is one of the most amazing; it allows us to use the model’s error to improve the next model, providing an even better performance — Machine Learning learning with Machine Learning, that is, a model that learns from its mistakes.

AdaBoost assigns weights to the instances in the dataset, defining how easy or difficult they are for the classification process, allowing the algorithm to pay more or less attention to the instances during the model construction process.

# AdaBoost Classifier Model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting trees
num_trees = 30
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
modell = AdaBoostClassifier(n_estimators = num_trees,
random_state = seed)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print Result
print("AdaBoost Classifier Accuracy: %.3f" % (result.mean() * 100))
# Result 75.52%

We have a 75% accuracy, slightly below the previous models, since we need to work better with pre-processing this data.

10. Stochastic Gradient Boosting

The Stochastic Gradient Boosting is also one of the most sophisticated Ensemble methods that use some quite advanced mathematical techniques.

# Stochastic Gradient Boosting
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting number of tress
num_trees = 100
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = GradientBoostingClassifier(n_estimators = num_trees, random_state = seed)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print result
print("Gradient Boosting Classifie Accuracy: %.3f" % (result.mean() * 100))
# Result 75.91%

What we do differently is basically change the model. We apply GradientBoostingClassififer, perform cross-validation, and get to the same level as individual algorithms at 75.9% without any further data processing.

11. Voting Ensemble

Bagging and Boosting are models of the same type; that is, they are all decision trees. While voting models allow you to create multiple models of different types, we work with models of the same type in previous ensemble models. From now on, we will work with completely different algorithms.

This is one of the simplest Ensemble methods. This method creates two or more separate templates from the training dataset. The Voting Classifier then uses the average of the predictions of each submodel to make the predictions in new datasets. The predictions of each sub-model can receive weights through manually defined parameters or heuristics. There are more advanced versions of Voting, where the model can learn the best weight to be assigned to sub-models. This is called Stacked Aggregation but is not yet available in Scikit-learn.

# Voting Ensemble
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating empty list
estimators = []
# Logistic Regression Model
model1 = LogisticRegression()
estimators.append(('logistic', model1))
# Decision Tree Classifier Model
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
# SVC Model
model3 = SVC()
estimators.append(('svm', model3))
# Creating ensemble model passing the list into VotingClassifier
ensemble = VotingClassifier(estimators)
# Cross Validation
result = cross_val_score(ensemble, X, Y, cv = kfold)
# Print Result
print("Voting Ensemble Accuracy: %.3f" % (result.mean() * 100))
# Result 74.74%

For this case, we need the VotingClassifier of the ensemble package and the other algorithms to perform the voting method. We do the standard import process, cut sets, specify parameters and folds.

However, we have now created an empty list of estimators, and this list is empty. We will receive the estimators as we append it; that is, we create the models, attach them to the list, feed this list of VotingClassifier, cross-validate, and return a 74.3% accuracy; we had slightly lower performance.

We have to understand that now are 3 models to work with. Try to optimize only 1 model that is already complex, so in practice requires much more pre-processing of the data so that we had a better performance. The Ensemble method is no guarantee of performance, but it can be interesting to try it if we can’t perform with individual algorithms.

12. XGBoost

This algorithm is a kind of disguised weapon of the winners of Kaggle’s Data Science competitions. This algorithm is an extension of the Gradient Boosting (GBM) algorithm that allows you to work with multithreading (run the algorithm and parallel way) on a single machine and parallel processing in a multi-server cluster.

The main advantage of XGBoost over GBM is its ability to manage sparse data. XGBoost automatically accepts sparse data as input without storing zeros in memory.

1- Accept sparse data (which allows working with sparse matrices) without converting to dense matrices.

2- Builds a learning tree using a modern split method (called quantile sketch), which results in a much shorter processing time than traditional methods.

3- Allows parallel computing on a single machine (through multithreading) and parallel processing on clustered distributed machines.

XGBoost uses the same GBM parameters and allows advanced processing of missing data. Gradient Boosting improves some aspects to run faster without losing its accuracy; it fixes some of the ensemble’s problems, especially concerning multithreading. Gradient Boosting ables the algorithm to run in parallel, running multiple threads on the CPU or parallelizing the entire GPU algorithm execution even faster without losing the main advantage of high accuracy. XGBoost is widely used by Data Scientists who win competitions in Kaggle. Github repository: https://github.com/dmlc/XGBoost

Install XGBoost from PyPI

Run the instantiation directly on the Operating System. It requires NumPy and SciPy as a prerequisite.

!pip install xgboost

XGBoost Classifier

# XGBoost Classifier
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Creating train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = XGBClassifier()
# Training model
model.fit(X_train, y_train)
# Print model
print(model)
# Making predictions
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# Evaluating predictions
accuracy = accuracy_score(y_test, predictions)
print("XGBoost Accuracy: %.2f%%" % (accuracy * 100.0))
# result 77.95%

This time we will use train_test_split instead of the cross_val_score because of the high processing load it would require. In the end, we have a model created with all the hyperparameters used and reaching an accuracy of 77.95%, the highest accuracy achieved of all algorithms built so far. With XGBoost, we perform the largest of all the accuracy seen so far, under the same conditions — without processing, normalization, standardization, or special treatment.

We did not go into detail about the algorithms. The purpose here was to understand building the model from pre-processing, machine learning, model evaluation, and prediction.

And there we have it. I hope you have found this useful. Thank you for reading. ?


Building 10 Classifier ??Models in Machine Learning +Notebook was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Anello

A detailed process of building Classifier Models

Foto de Magda Ehlers no Pexels

In the last tutorial, we completed the Data Pre-Processing step. We saw preprocessing techniques applied in transformation and variable selection, dimensionality reduction, and sampling for machine learning throughout this previous tutorial.

Now we can move on to the next steps within the Data Science process, where we’ll apply the rest of the model building process with various classification algorithms to understand what it is and how to use machine learning with python language. In the next moment, we will discuss the Regression algorithms.

We will not go into detail about the algorithms. The purpose here will be to understand the detailed process of building the Machine Learning model, machine learning, model evaluation, and prediction scans.

Jupyter Notebook

See The Jupyter Notebook for the concepts we’ll cover on building machine learning models and my LinkedIn profile for other Data Science articles and tutorials.

Evaluating Performance

The metrics chosen to evaluate model performance will influence how performance is measured and compared to models created with other algorithms. We need to find a metric to measure performance between models solidly and coherently, a metric comparable to the models analyzed. Let’s use the same algorithm, but with different metrics, and so compare the results.

Metrics for Classification Algorithms

Accuracy is undoubtedly the most widely used metric. If we have a model with 85% accuracy, every 100 predictions, a model hits correctly 85 of the time. The function cross_val_score() will be used to evaluate performance. If we have an unbalanced dataset, the accuracy may fail.

# Logistic Regression
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation 
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = LogisticRegression()
# Cross Validation - Scoring = 'accuracy'
result = cross_val_score(model, X, Y, cv = kfold, scoring = 'accuracy')
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 77.08%

Above, we create the model and check its accuracy. We import the modules, load the data, divide X (input)and Y(output), define some parameters, create KFold — the divisions we use in cross-validation, create the logistic regression model, and print the result.

This was the same process repeated several times in the data preprocessing step. Therefore, for every 100 predictions, the model hits 77 of them. Accuracy is the most straightforward metric of all, and we set the scoring parameter as accuracy; that is, when we run the cross_val_score function, we can tell which metrics we want to use within cross-validation.

A single line of code can train the algorithm, test the model, and consider accuracy as a performance assessment metric. Cross-validation is undoubtedly an excellent technique for working with Machine Learning.

AUC (Binary Classification)

Allows you to analyze the AUC — area under the curve metric. AUC is a performance metric for binary classification, where we can set classes to positive and negative. Binary classification problems are a trade-off between Sensitivity and Specificity:

  • Sensitivity is the rate of True Positive (TP). This is the number of positive instances of the first class that the model predicted correctly.
  • Specificity is the True Negative rate (TN). This is the number of instances of the second class that the model predicted correctly.

Values above 0.5 indicate a good prediction rate.

# AUC Curve
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating Logistic Regression model
model = LogisticRegression()
# Cross Validation - scoring ='roc_auc'
result = cross_val_score(model, X, Y, cv = kfold, scoring = 'roc_auc')
# Apllying the average for the result
print("AUC: %.3f" % (result.mean() * 100))
# Result: 82.56%

We are working with the same model so that we have the same parameter of comparison. The only difference we have in cross_val_score is that we changed the accuracy scoring ‘accuracy’ metric to ‘roc_auc.’ We have an 82% accuracy rate, a model with a high accuracy rate.

Confusion Matrix

Both accuracy and the AUC curve take into account the confusion matrix, showing precisely the results of the predictions of our model.

# Confusion Matrix
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Loading Data
data = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets instead cross_val_score
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
# Training model
model.fit(X_train, Y_train)
# Making predictions
predictions = model.predict(X_test)
matrix = confusion_matrix(Y_test, predictions)
# Printing confusion matrix
print(matrix)
# [[141  21]
# [ 41 51]]
# 151, 51 indicates correct prediction
# 21, 41 indicates wrong predictions

Instead of using cross-validation, we use train_test_split from model_selection. We create the logistic regression model, do the training, and end the predictions with the test data. As output, we have the confusion matrix that indicates that 141 and 51 are the correct answers of the model and 21 and 41 are the model’s errors; that is, the model is hitting more than missing.

Classification Report

We can use an alternative to print a classification report with multiple concurrent metrics instead of printing each metric individually:

# Classification Report
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
model.fit(X_train, Y_train)
# Making predictions
prediction = model.predict(X_test)
# Setting classification report
report = classification_report(Y_test, prediction)
# Printing report
print(report)

precision recall f1-score support
0.0 0.77 0.87 0.82 162
1.0 0.71 0.55 0.62 92
micro avg 0.76 0.76 0.76 254
macro avg 0.74 0.71 0.72 254
weighted avg 0.75 0.76 0.75 254

We load the data, divide it into X and Y, define the parameters, split it into training and testing subsets, train the model, test it and finally call the classification_report — we present the Y test (what’s already in our data set). We take the predictions of our model and compare it with the data that we already know the result. With this, we calculate the final performance and deliver the performance report with other metrics.

We will hardly create just one version of the model; we will create up to 20 versions of a model until we reach a better result comparing performance metrics.

Creating Classification Models.

From now on, we’ll see how to create various machine learning models for classification. Building the Machine Learning model is the most straightforward step in the process; already working with pre-processing is infinitely more laborious than creating the model itself. One of the first steps within the Machine Learning project is to define whether we are facing a Classification, Regression, or, eventually, unsupervised learning problem.

Algorithms have a difference concerning supervised and unsupervised learning, and besides, there is also the difference within supervised learning in classification and regression algorithms. Everything we’re doing so far refers to sorting algorithms.

Classification Algorithms

We have no way of knowing which algorithm will work best to construct the model before testing the algorithm with our dataset. A comparison metric will test several algorithms and select the best model, a performance evaluation metric.

The ideal is to test some algorithms and then choose the one that provides the best level of accuracy. Let’s try a set of sorting algorithms under the same conditions.

1. Logistic Regression (Classification)

It is a Linear Algorithm that allows you to divide the data into two or more categories: the output classes according to the problem.

The Logistic Regression algorithm assumes that the data is in a Normal Distribution for numeric values that the algorithm can model with binary classification.

# Logistic Regression Classifier
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = LogisticRegression()
# Standard Cross Validation - accuracy
results = cross_val_score(model, X, Y, cv = kfold)
# Print result
print("Accuracy: %.3f" % (results.mean() * 100))
# Result: 77.08%

The Logistic Regression algorithm has an accuracy of 77%. We use cross-validation with cross_val_score, which does the training and directly already tests the model. Because the metric has not been broken down, the default metric cross_val_score the measure that is accuracy.

2. Linear Discriminant Analysis (Binary Classification)

It is a Linear algorithm for binary classification. It also assumes that the data is in Normal Distribution; that is, LDA and logistic regression expect to receive the data in a standard format. We need to check this in exploratory analysis and apply data standardization to leave them with a mean 0 and standard deviation of 1.

# Linear Discriminant Analysis
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data in folds
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
LinearDiscriminantAnalysis()
# Cross Validation
resultado = cross_val_score(model, X, Y, cv = kfold)

# Print result
print("Linear Disciminant Analysis Accuracy: %.3f" % (result.mean() * 100))
# Result: 76.69%

The difference between this code to the previous one is merely the change of algorithm. We apply the Linear Discriminant Analysis algorithm, which has a slightly lower accuracy than the Logistic Regression model.

3. KNN (K-Nearest Neighbors)

It is a nonlinear algorithm that uses a distance metric to find the k value that best suits the instances of the training dataset. For each data point, KNN calculates a Euclidean distance for each point, places it in a table, and then makes predictions for new data.

# KNN
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
random_state = 7
# Separating data into folds for Cross-Validation
Kfold = KFold(num_folds, True, random_state = random_state)
# Creating model
model = KNeighborsClassifier()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
print("KNN K-Nearest Neighbor Accuracy: %.3f" % (result.mean() * 100))
#Result: 76.69%

We changed the algorithm again. This time we import the KneighborClassifier that belongs to the neighbor’s package. We achieved performance similar to the previous LDA algorithm.

4. Naive Bayes (Probabilistic Algorithm)

Another Nonlinear algorithm. The Naive Bayes is a very famous probabilistic algorithm. Calculates the probability of each class and the conditional probability of each class to sort the data. The algorithm assumes data in Gaussian distribution (Normal) with GaussianNB.

# Naive Bayes 
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = GaussianNB()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
print("Naive Bayes Accuracy: %.3f" % (result.mean() * 100))
# Result: 75.91%

5. CART (Classification and Regression Trees)

It is a Non-Linear algorithm that builds a binary tree from the training dataset. Each attribute and each value of each feature are evaluated to reduce the cost function.

# CART (Classification and Regression Tree)
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating DecisionTree model
model = DecisionTreeClassifier()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# CART Accuracy
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 69.80%

6. SVM — Vector Machines Support

It’s one of the most fantastic machine learning algorithms there is. The SVM takes the data in a non-linear dimension, adds another dimension, goes up the dimension data, and sorts it. To do this, we use the SVC of the SVM package.

# Support Vector Machine
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating SVC model
model = SVC()
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Result Support Vector Machine Classifier
print("Accuracy: %.3f" % (result.mean() * 100))
# Result: 65.11%

Although SVM is one of the best algorithms in existence, it has had the worst accuracy yet. However, the more complex the algorithm, the more sensitive it is to preprocess the data — no preprocessing has been done here. Therefore, it is understandable the poor performance of the SVM.

What have we done?

What we’ve done so far, in practice, from one model to another, is change the algorithm used. The rest was pretty much the same thing.

We can automate this work of experimenting and testing several different algorithms through programming language; After all, Python offers several programming features to automate this work.

Selecting the best predictive model.

Now we’ll create a piece of code that will execute everything we’ve done above, but in an automated way. Next, we will compare all the models and choose the one with the best performance. In this way, we automate our work and quickly experiment and test several different algorithms.

# Importing all algorithms used
from pandas import read_csv
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Loading Data
file = 'pima-data.csv'
file = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Creating empty list 
models = []
# Machine Learning Algorithms list
models.append(('LR', LogisticRegression())) #binary class
models.append(('LDA', LinearDiscriminantAnalysis())) #binary class
models.append(('NB', GaussianNB())) # binary class
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC()))
# Evaluating each model in a loop
results = [] # result list
names = [] # names list
for name, model in models:
kfold = KFold(n_splits = num_folds, random_state = seed)
cv_results = cross_val_score(model,
X,
Y,
cv = kfold,
scoring = 'accuracy')
results.append(cv_results) 
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot results
fig = plt.figure()
fig.suptitle('Comparison of Classification Algorithms')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

We created two lists — a list for results and another list for the names of machine learning algorithms.

We then create a for loop to scroll through the list of templates. For each item in the list, KFold will be split to generate data division into training and testing, run cross_val_score and append. The result of cross_val_score is included with the append function in the empty list of products next to the names. We stay in this loop for each algorithm.

With the code above, we trained and tested six different algorithms. The LR and LDA algorithms are the ones that presented the best performance through accuracy above 76%.

We evaluate the models by putting the results in boxplots. The yellow lines within each boxplot represent exactly accuracy, and on the x-axis, we have the names of the algorithms.

This machine learning building is to unite knowledge from various areas. We combine knowledge of computer programming, machine learning, pre-processing data, modeling, business problems — all to be able to create multiple models, compare their performances and then select the best model. Now we can optimize our machine learning model.

Model Optimization

After we create the model, we can still try to optimize it by adjusting the hyperparameters.

We are working on a binary classification problem; that is, we have two possible outputs. For binary classification, Logistic Regression, Linear Discriminant Analysis, and Naive Bayes are, in general, the best options.

On the other hand, if we work with multi-class classification, that is, multiple output classes. Rather than predicting whether or not a person will have diabetes, we will try to predict the disease’s category — early stage, intermediate, advanced, or develop the disease; we could create 4 or 5 different classes. An SVM or neural network could likely be a more interesting option.

Once one of the models has been selected, we can proceed to model optimization or hyperparameter adjustment. All machine learning algorithms are parameterized, which means that we can adjust the predictive model’s performance by tuning the parameters, i.e., fine-tuning the parameters.

Our job is to find the best combination of parameters in each machine learning algorithm. This process is also called Hyperparameter Optimization, and scikit-learn offers two methods for automatic parameter optimization:

  • Grid Search Parameter Tuning
  • Random Search Parameter Tuning.

Each Machine Learning algorithm comes with a set of parameters that we can technically call hyperparameters, which are ways to change the algorithm’s behavior — the problem is that we don’t know the best combination of hyperparameters for each dataset and each business problem.

With that in mind, sklearn developers have created other automatic options to test various combinations of hyperparameters. We can find the variety of hyperparameters that best fits the business and dataset problem we’re working on through this testing process. It is worth testing this procedure after choosing one of the models.

Two main methods for optimizing models

We will use the Logistic Regression algorithm to elucidate the problem, although the Linear Discriminant Analysis has shown better accuracy than the LogisticRegression algorithm. In practice, the performance was almost identical.

Because logistic regression offers more hyperparameters and is easier to adjust than LDA, let’s use this. In the end, the choice rests exclusively with the Data Scientist; that is, the tools serve only as support, but the final decision of how the analysis process will be. At the moment, we have several models created, we who choose on top of which model to work.

1. Grid Search Parameter Tuning

This method methodically performs combinations between all algorithm parameters, creating a grid, a table of combinations. This table will test several hyperparameters, but we have to indicate which parameter values we want for the grid search to try the combination.

We’ll import the GridSearchCV function from the model_selection and the LogisticRegression algorithm, load the data and divide it into X and Y.

Then we create the grid, which is a dictionary with a set of key: value pairs. We have the ‘penalty’ key named the hyperparameter and a list of values [11, 12]. The name of another hyperparameter named ‘C’ is followed by a list of values to test for that hyperparameter.

To know the name of these hyperparameters, we must access the documentation of the algorithm we are working on in the LogisticRgression case. Each algorithm will have a specific set of hyperparameters.

IMAGEM DOCUMENTACAO

The above values are the default values for each algorithm parameter used by scikit-learn. When we work with regression, we can penalize the algorithm for preventing it from suffering from over adjustment, while the “C” is the inverse of the regularization force. That is, it is directly linked to the penalty.

# Grid Search Parameter Tuning
from pandas import read_csv
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting Hyperparameters
grid_values = {'penalty': ['l1','l2'],
'C': [0.001,0.01,0.1,1,10,100,1000]}
# Creating model
model = LogisticRegression()
# Creating grid
grid = GridSearchCV(estimator = model, param_grid = grid_Values)
# Training grid
grid.fit(X, Y)
print("Accuracy: %.3f" % (grid.best_score_ * 100))
print("Best Model Parameters:", grid.best_estimator_)
# Result: 77.08%
# Best Model Parameters: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

Therefore, we instantiate the LogisticRegression model, call the GridSearchCV function, indicate that we will use the model we have just created and use as param_grid the grid_values set with penalty and C.

After that, we feed GridSearchCV with the X and Y sets, applying fit. We run the cell and return an accuracy of 77%, indicating the best combination of hyperparameters with GridSearchCV optimization.

2. Random Search Parameter Tuning

Another option is to use the Random Search Parameter Tuning method, which generates samples of algorithm parameters from a uniform random distribution to a fixed iteration number. A model is constructed and tested for each combination of parameters; that is, this method randomly seeks the combination of parameters.

Generally, the RandomizedSearchCV method is slightly slower than the GridSearchCV method but may show a better result.

# Random Search Parameter Tuning
from pandas import read_csv
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
seed = 7
iterations = 14
grid_values = {'penalty': ['l1','l2'],
'C': [0.001,0.01,0.1,1,10,100,1000]}
# Creating model
model = LogisticRegression()
# Creating grid for RandomizedSearch
rsearch = RandomizedSearchCV(estimator = model,
param_distributions = grid_values,
n_iter = iterations,
random_state = seed)
# Training Randomized Search
rsearch.fit(X, Y)
print("Accuracy: %.3f" % (rsearch.best_score_ * 100))
print("Best Model Parameters:", rsearch.best_estimator_)
# result 77.08%
# Best Model Parameters: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn',n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

When we now call the RandomizedSearchCV function, we have a few more parameters to specify: the model, grid_values, number of iterations (more performance, more time consuming, more computational resource), and then we do the fit with X and Y.

When running, we have accuracy with RandomizedSearchCV of 77%, equal to the accuracy of GridSearchCV, besides having found the same hyperparameters. Ideally, we should use GridSearch, which generally performs well and saves time.

Save and load the trained model.

So far, we’ve seen model optimization and hyperparameter tuning. Now, we’ll see how to save the result of our work, i.e., save the model designed to disk. If we close the Notebook, everything we run will have to run again; depending on the model, it may take days to be trained again.

At some point, we will also use this model to make new predictions. We will present new data to the model. We’ll soon need to load the saved model to disk and then perform our predictions. Saving the template and uploading it are two important actions within the entire process.

We will use the pickle package, which allows us to save the model in a specific binary format. We save in pickle format and then load in the same format.

# Saving the result
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
# Loading Data
file = 'pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
test_size = 0.33
seed = 7
# Creating train and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = LogisticRegression()
# Training model
model.fit(X_train, Y_train)
# Saving the template in the templates director
save_file = 'documents/final_classifier.sav'
pickle.dump(model, open(save_file, 'wb'))
print("Model saved.")
# Loading the pickle.load file
final_classifier = pickle.load(open(file, 'rb'))
model_prod = final_classifier.score(X_test, Y_test)
print("Model loaded.")
print("Accuracy: %.3f" % (model_prod.mean() * 100))
# Result 75.59%

Therefore, we load the data, divide it into X and Y, split the dataset into training and testing with train_test_split, create the logistic regression model, train the model with a fit, and finally save the model.

We save the template in the documents directory and give the name final_classifer through a pickle.dump to download the template’s contents and write to the file, precisely on the save_file object and the ‘wb’ write privilege. Model saved.

Then, to load the model, we use the pickle.load method, indicate the opening with the open function and point to the file we save as save_file.

Optimizing Performance with Ensemble Methods

So far, we have experienced several individual algorithms. Now we will try to use two or more algorithms and work together in a single package.

The Ensemble method is a set of algorithms that works as if it were a single package, taking several different algorithms and achieving better performance. We have three main categories for these methods:

Bagging: serves to build multiple models (typically of the same type) from different subsets in the training dataset, and each sample will apply multiple machine learning models.

Boosting: is used for building multiple models (typically of the same type), where each model learns to correct the errors generated by the previous model within the sequence of created models. This category often performs well, the algorithms create models throughout training that generate error rates, and the algorithm will use these error rates to train the next model — they are models of the same type. Still, they will learn from the errors of the previous model.

Voting: The construction of multiple models (usually of different types) and simple statistics (such as average) are used to combine predictions. It is a voting system; that is, we take several completely different algorithms and work in parallel; at the end, a vote is made.

Using the ensemble method is no guarantee that we will achieve greater accuracy.
We should devote ourselves to pre-processing the data regardless of the algorithm we are dealing with.

If we can’t achieve our goal with unique algorithms, the accuracy we determine for the project. We need to experiment with other machine learning models and architectures until we can beat the targets or even conclude that we can’t achieve the desired accuracy with the data we have.

7. Bagged Decision Tree

It is a method that works when there is a high variance in the data. We need to import the BaggingClassifier algorithm from the ensemble package and the DecisionTreeClassifier from the tree package.

We load the data, divide it into X and Y, define parameters, create folds for cross-validation, and create a Machine Learning model. A decision tree bears the name of DecisionTreeClassifier.

# Bagged Decision Tree
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# Loading Data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator = cart,
n_estimators = num_trees,
random_state = seed)
result = cross_val_score(modelo, X, Y, cv = kfold)
print("Bagged Decision Tree Accuracy: %.3f" % (result.mean() * 100))
# Result 75.91%

Next, we define how many trees will be created and feed the BaggingClassifier. The DecisionTreeClassifier model is considered a weak classifier; it is a loose machine learning algorithm. The BaggingClassifier will receive the cart estimator, multiply by 100x through n_estimators and use the seed to reproduce the same results.

After that, we applied the cross-validation and performed the training, reaching an accuracy of 75.9%. When comparing individual algorithms’ accuracy, we can see that BaggingClassifier surpassed the SVM, CART, KNN, and Naive Bayes, considering that we had not done any specific pre-processing.

Therefore, we have reached 4 of the six individual algorithms using simply the ensemble method. If we pre-process and adjust the Hyperparameters, the Ensemble method will most likely exceed 80% accuracy with BaggingClassifier.

8. Random Forest Classifier

This algorithm is an excellent choice for selecting variables. Random Forest is an extension of the Bagging Decision Tree.

In practice, Random Forest is in the Bagging category, where we put several weak classifiers to work together and formulating a more robust algorithm.

# Random Forest Classifier
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting trees
num_trees = 100
max_features = 3
# Separating data into folds for Cross-Validationk
fold = KFold(num_folds, True, random_state = seed)
# Creating Model
model = RandomForestClassifier(n_estimators = num_trees,
max_features = max_features)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print Result
print("Random Forst Accuracy: %.3f" % (result.mean() * 100))
# Result 77.34%

Therefore, we import the RandomForestClassifier algorithm from the ensemble package, load the data, divide X and Y, define the parameters and folds of cross-validation, create the RandomForestClassifier model, and finally,,, cross-validate. We have 77% accuracy, surpassing the accuracy of Logistic Regression and having practically the same level of LDA.

9. AdaBoost

Algorithms based on Boosting Ensemble. The two RandomForestClassifier algorithms in BaggingClassifier are in the Bagging category, while AdaBoost is in the Boosting category.

Boosting algorithms create a sequence of models that attempt to correct their errors based on previous models within the sequence. Once created, the models make predictions that can receive a weight according to their accuracy, and the results are combined to create a single final prediction.

The Boosting category is one of the most amazing; it allows us to use the model’s error to improve the next model, providing an even better performance — Machine Learning learning with Machine Learning, that is, a model that learns from its mistakes.

AdaBoost assigns weights to the instances in the dataset, defining how easy or difficult they are for the classification process, allowing the algorithm to pay more or less attention to the instances during the model construction process.

# AdaBoost Classifier Model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting trees
num_trees = 30
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
modell = AdaBoostClassifier(n_estimators = num_trees,
random_state = seed)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print Result
print("AdaBoost Classifier Accuracy: %.3f" % (result.mean() * 100))
# Result 75.52%

We have a 75% accuracy, slightly below the previous models, since we need to work better with pre-processing this data.

10. Stochastic Gradient Boosting

The Stochastic Gradient Boosting is also one of the most sophisticated Ensemble methods that use some quite advanced mathematical techniques.

# Stochastic Gradient Boosting
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Setting number of tress
num_trees = 100
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating model
model = GradientBoostingClassifier(n_estimators = num_trees, random_state = seed)
# Cross Validation
result = cross_val_score(model, X, Y, cv = kfold)
# Print result
print("Gradient Boosting Classifie Accuracy: %.3f" % (result.mean() * 100))
# Result 75.91%

What we do differently is basically change the model. We apply GradientBoostingClassififer, perform cross-validation, and get to the same level as individual algorithms at 75.9% without any further data processing.

11. Voting Ensemble

Bagging and Boosting are models of the same type; that is, they are all decision trees. While voting models allow you to create multiple models of different types, we work with models of the same type in previous ensemble models. From now on, we will work with completely different algorithms.

This is one of the simplest Ensemble methods. This method creates two or more separate templates from the training dataset. The Voting Classifier then uses the average of the predictions of each submodel to make the predictions in new datasets. The predictions of each sub-model can receive weights through manually defined parameters or heuristics. There are more advanced versions of Voting, where the model can learn the best weight to be assigned to sub-models. This is called Stacked Aggregation but is not yet available in Scikit-learn.

# Voting Ensemble
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Separating data into folds for Cross-Validation
kfold = KFold(num_folds, True, random_state = seed)
# Creating empty list
estimators = []
# Logistic Regression Model
model1 = LogisticRegression()
estimators.append(('logistic', model1))
# Decision Tree Classifier Model
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
# SVC Model
model3 = SVC()
estimators.append(('svm', model3))
# Creating ensemble model passing the list into VotingClassifier
ensemble = VotingClassifier(estimators)
# Cross Validation
result = cross_val_score(ensemble, X, Y, cv = kfold)
# Print Result
print("Voting Ensemble Accuracy: %.3f" % (result.mean() * 100))
# Result 74.74%

For this case, we need the VotingClassifier of the ensemble package and the other algorithms to perform the voting method. We do the standard import process, cut sets, specify parameters and folds.

However, we have now created an empty list of estimators, and this list is empty. We will receive the estimators as we append it; that is, we create the models, attach them to the list, feed this list of VotingClassifier, cross-validate, and return a 74.3% accuracy; we had slightly lower performance.

We have to understand that now are 3 models to work with. Try to optimize only 1 model that is already complex, so in practice requires much more pre-processing of the data so that we had a better performance. The Ensemble method is no guarantee of performance, but it can be interesting to try it if we can’t perform with individual algorithms.

12. XGBoost

This algorithm is a kind of disguised weapon of the winners of Kaggle’s Data Science competitions. This algorithm is an extension of the Gradient Boosting (GBM) algorithm that allows you to work with multithreading (run the algorithm and parallel way) on a single machine and parallel processing in a multi-server cluster.

The main advantage of XGBoost over GBM is its ability to manage sparse data. XGBoost automatically accepts sparse data as input without storing zeros in memory.

1- Accept sparse data (which allows working with sparse matrices) without converting to dense matrices.

2- Builds a learning tree using a modern split method (called quantile sketch), which results in a much shorter processing time than traditional methods.

3- Allows parallel computing on a single machine (through multithreading) and parallel processing on clustered distributed machines.

XGBoost uses the same GBM parameters and allows advanced processing of missing data. Gradient Boosting improves some aspects to run faster without losing its accuracy; it fixes some of the ensemble’s problems, especially concerning multithreading. Gradient Boosting ables the algorithm to run in parallel, running multiple threads on the CPU or parallelizing the entire GPU algorithm execution even faster without losing the main advantage of high accuracy. XGBoost is widely used by Data Scientists who win competitions in Kaggle. Github repository: https://github.com/dmlc/XGBoost

Install XGBoost from PyPI

Run the instantiation directly on the Operating System. It requires NumPy and SciPy as a prerequisite.

!pip install xgboost

XGBoost Classifier

# XGBoost Classifier
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Loading data
file = 'data/pima-data.csv'
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(file, names = columns)
array = data.values
# Separating the array into input and output
X = array[:,0:8]
Y = array[:,8]
# Setting parameters
num_folds = 10
seed = 7
# Creating train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)
# Creating model
model = XGBClassifier()
# Training model
model.fit(X_train, y_train)
# Print model
print(model)
# Making predictions
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# Evaluating predictions
accuracy = accuracy_score(y_test, predictions)
print("XGBoost Accuracy: %.2f%%" % (accuracy * 100.0))
# result 77.95%

This time we will use train_test_split instead of the cross_val_score because of the high processing load it would require. In the end, we have a model created with all the hyperparameters used and reaching an accuracy of 77.95%, the highest accuracy achieved of all algorithms built so far. With XGBoost, we perform the largest of all the accuracy seen so far, under the same conditions — without processing, normalization, standardization, or special treatment.

We did not go into detail about the algorithms. The purpose here was to understand building the model from pre-processing, machine learning, model evaluation, and prediction.

And there we have it. I hope you have found this useful. Thank you for reading. ?


Building 10 Classifier ??Models in Machine Learning +Notebook was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Anello


Print Share Comment Cite Upload Translate Updates
APA

Anello | Sciencx (2021-04-27T03:49:17+00:00) Building 10 Classifier Models in Machine Learning +Notebook. Retrieved from https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/

MLA
" » Building 10 Classifier Models in Machine Learning +Notebook." Anello | Sciencx - Tuesday April 27, 2021, https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/
HARVARD
Anello | Sciencx Tuesday April 27, 2021 » Building 10 Classifier Models in Machine Learning +Notebook., viewed ,<https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/>
VANCOUVER
Anello | Sciencx - » Building 10 Classifier Models in Machine Learning +Notebook. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/
CHICAGO
" » Building 10 Classifier Models in Machine Learning +Notebook." Anello | Sciencx - Accessed . https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/
IEEE
" » Building 10 Classifier Models in Machine Learning +Notebook." Anello | Sciencx [Online]. Available: https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/. [Accessed: ]
rf:citation
» Building 10 Classifier Models in Machine Learning +Notebook | Anello | Sciencx | https://www.scien.cx/2021/04/27/building-10-classifier-models-in-machine-learning-notebook/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.