Naive Bayes Classifier

For my first build project in the Computer Science unit of Lambda School, I chose to implement a Gaussian Naive Bayes classifier.
The completed project repository is hosted on Github.

Naive Bayes Classification

A Naive Bayes classifier is an algorithm based on Bayes’ Theorem. Bayes’ Theorom attempts to describe the probability of an event occuring based on prior knowledge of conditions pertaining to that event. A Naive Bayes classifier is considered ‘naive’ in that it assumes no dependence between the features in order to reduce computational complexity.

Implementation

I built my classifier using Scikit-Learn’s BaseEstimator and ClassifierMixin as base classes for a familiar starting point. I used several sklearn utility functions to handle input validation. I also used some functions from Python’s math module and numpy.

import math

import numpy as np

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils import check_X_y, check_array
from sklearn.utils.validation import check_is_fitted

The basic class definition contains the public methods I am planning to implement: fit and predict.

class NaiveBayes(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        return self
        
    def predict(self, X):
        predictions = []
        return np.array(predictions)

I decided to put all my input validation logic in one function that is called from both the fit and predict methods.

    def _validate_input(self, X, y=None):
        if y is not None:
            # fitting the model, validate X and y
            return check_X_y(X, y)
        else:
            # predicting, validate X
            check_is_fitted(self, ['num_features_', 'feature_summaries_'])
            X = check_array(X)
            if X.shape[1] != self.num_features_:
                raise(ValueError('unexpected input shape: (x, {X.shape[1]}); must be (x, {self.num_features_})'))
            return X

The fit method implements two operations: splitting the training data by its class label, and storing the mean and standard deviation of each feature according to its label. I stored the information in a dictionary as a simple way to keep the information pertaining to each class separate. Using the class label as keys in the dictionary allows this implementation to accept any hashable type as the class label. This way I do not have to do any converting or tracking of the labels - I can use them as is.

        # create dictionary containing input data separated by class label
        data_by_class = {}
        for i in range(len(X)):
            features = X[i]
            label = y[i]
            if label not in data_by_class:
                # first occurence of label, create empty list in dictionary
                data_by_class[label] = []
            data_by_class[label].append(features)

After separating the data by class, I store the mean and standard deviation of each feature as an instance variable.

        self.feature_summaries_ = {}
        for label, features in data_by_class.items():
            self.feature_summaries_[label] = [
                (np.mean(column), np.std(column))
                for column in zip(*features)
            ]

Once the training data has been fit, the classifier is ready to make predictions. Predictions are made in two steps for each row of input data. The first step accumulates the log probabilites of each of the features, for each class. I used the sum of the log of each probability instead of the product of the probabilities themselves to avoid possible floating point issues that can arise when multiplying several values that are expected to be close to zero. This is not an issue for small datasets, but can become a problem when calculating for a large amount of features.

        for x in X:
            # get cumulative log probabilites for each class for this row
            probabilities = {}
            for label, features in self.feature_summaries_.items():
                probabilities[label] = 0
                for i in range(len(features)):
                    mean, stdev = features[i]
                    probabilities[label] += math.log2(
                        self._liklihood(x[i], mean, stdev)
                    )

Once I have the probabilites for each class, I loop though those results and select the class with the highest probability.

            # find class with highest probability
            best_label, best_prob = None, -1
            for label, probability in probabilities.items():
                if best_label is None or probability > best_prob:
                    best_prob = probability
                    best_label = label

            # prediction for this row
            predictions.append(best_label)

Although this classifier works just as well as sklearn’s GaussianNB on a small, clean dataset (I tested on sklearn’s wine dataset), it is not ready for use in a production environment. Dirty data, large data, and who knows what else will break this implementation.

Results vs sklearn's GaussianNB

Resources

Naive Bayes Classification

SciKit-Learn