Regularized Logistic Regression from Scratch

In [3]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
In [2]:
from __future__ import division
import graphlab
In [4]:
products = graphlab.SFrame('amazon_baby_subset.gl/')

Data Manipulation

Remove Punctuation, set word counts

We will work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations: 1. Remove punctuation and 2. Compute word counts (only for the important_words)

In [6]:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]

def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

# Remove punctuation.
products['review_clean'] = products['review'].apply(remove_punctuation)

# Split out the words into individual columns
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

Train & Validation split, make numpy

In [7]:
train_data, validation_data = products.random_split(.8, seed=2)

This function extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels.

In [8]:
import numpy as np
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

We convert both the training and validation sets into NumPy arrays.

In [9]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment') 

Regularized Logistic Regression

Calculating Probabilities

The link function for logistic regression can be defined as following, where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of important_words in the review $\mathbf{x}_i$. $$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$

In [10]:
def predict_probability(feature_matrix, coefficients):
    scores = np.dot(feature_matrix, coefficients)
    probabilities = 1. / (1. + np.exp(-scores))
    return probabilities

Adding L2 penalty

To calculate the per-coefficient derivative of log likelihood, it takes only a small modification to add a L2 penalty compared to unregularized regression. All terms indicated in red refer to terms that were added due to an L2 penalty. We have sum(features * errors - 2 lambda coefficient), or more neat: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j } $$ As we did in the Regression course, we do not apply the L2 penalty on the intercept. A large intercept does not necessarily indicate overfitting because the intercept is not associated with any particular feature. So for the intercept term, we have $$ \frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$

In [11]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant): 
    derivative = np.dot(errors, feature)
    if not feature_is_constant: 
        derivative -= 2 * l2_penalty * coefficient
    return derivative

Log Likelihood

This function computes the log likelihood (used here for its numerical stability). $$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) \color{red}{-\lambda\|\mathbf{w}\|_2^2} $$

In [12]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
    return lp

Putting it together

In [13]:
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in xrange(max_iter):
        predictions = predict_probability(feature_matrix, coefficients)
        indicator = (sentiment==+1)
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            is_intercept = (j == 0)
            derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
            coefficients[j] += step_size * derivative
            
        # Checking whether log likelihood is increasing
        if (itr == (max_iter-1)):
            lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
            print 'iteration %*d: log likelihood of observed labels = %.8f with a penalty of %s' % (int(np.ceil(np.log10(max_iter))), itr, lp, l2_penalty)
    return coefficients

Explore effects of L2 regularization

Now that we have written up all the pieces needed for regularized logistic regression, let's explore the benefits of using L2 regularization in analyzing sentiment for product reviews. As iterations pass, the log likelihood should increase as we ascend the hill.

Train models with increasing L2 Penalty

Below, we train models with increasing amounts of regularization, starting with no L2 penalty.

In [14]:
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=0, max_iter=501)
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=4, max_iter=501)
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=10, max_iter=501)
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e2, max_iter=501)
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e3, max_iter=501)
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e5, max_iter=501)
iteration 500: log likelihood of observed labels = -19876.62333410 with a penalty of 0
iteration 500: log likelihood of observed labels = -19956.11341777 with a penalty of 4
iteration 500: log likelihood of observed labels = -20072.16321721 with a penalty of 10
iteration 500: log likelihood of observed labels = -21451.95551390 with a penalty of 100.0
iteration 500: log likelihood of observed labels = -25532.33970049 with a penalty of 1000.0
iteration 500: log likelihood of observed labels = -29271.17666862 with a penalty of 100000.0

The higher our penalty, the lower the likelihood, as we increase our training error.

Compare coefficients

We now compare the coefficients for each of the models that were trained above. We will create a table of features and learned coefficients associated with each of the different L2 penalty values. Below is a simple helper function that will help us create this table.

In [15]:
table = graphlab.SFrame({'word': ['(intercept)'] + important_words})
def add_coefficients_to_table(coefficients, column_name):
    table[column_name] = coefficients
    return table

Now, let's run the function add_coefficients_to_table for each of the L2 penalty strengths.

In [16]:
add_coefficients_to_table(coefficients_0_penalty, '[L2=0]')
add_coefficients_to_table(coefficients_4_penalty, '[L2=4]')
add_coefficients_to_table(coefficients_10_penalty, '[L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, '[L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, '[L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, '[L2=1e5]')[1:4]
Out[16]:
word [L2=0] [L2=4] [L2=10] [L2=1e2] [L2=1e3] [L2=1e5]
baby 0.0740730059216 0.0739938541405 0.0738773534804 0.0723603602218 0.0597516888364 0.0017841492163
one 0.0127525057784 0.0124949704481 0.0121152529534 0.0072472833187 -0.00876091762004 -0.00182685568023
great 0.801624989778 0.796896933003 0.789935147221 0.701425073675 0.376011714222 0.00894956049736
[3 rows x 7 columns]

Using the coefficients trained with L2 penalty 10, find the 5 most positive words (with largest coefficients). Similarly, find the 5 most negative words.

In [17]:
positive_words = table.sort(['[L2=10]'], ascending=False).head(5)['word']
negative_words = table.sort(['[L2=10]'], ascending=True).head(5)['word']
print positive_words
print negative_words
['love', 'loves', 'easy', 'perfect', 'great']
['disappointed', 'money', 'return', 'waste', 'returned']

Let us observe the effect of increasing L2 penalty on the 10 words just selected. We provide a utility function to plot the coefficient path.

In [18]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6

def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):
    cmap_positive = plt.get_cmap('Reds')
    cmap_negative = plt.get_cmap('Blues')
    
    xx = l2_penalty_list
    plt.plot(xx, [0.]*len(xx), '--', lw=1, color='k')
    
    table_positive_words = table.filter_by(column_name='word', values=positive_words)
    table_negative_words = table.filter_by(column_name='word', values=negative_words)
    del table_positive_words['word']
    del table_negative_words['word']
    
    for i in xrange(len(positive_words)):
        color = cmap_positive(0.8*((i+1)/(len(positive_words)*1.2)+0.15))
        plt.plot(xx, table_positive_words[i:i+1].to_numpy().flatten(), '-', label=positive_words[i], linewidth=4.0, color=color)
        
    for i in xrange(len(negative_words)):
        color = cmap_negative(0.8*((i+1)/(len(negative_words)*1.2)+0.15))
        plt.plot(xx, table_negative_words[i:i+1].to_numpy().flatten(), '-', label=negative_words[i], linewidth=4.0, color=color)
        
    plt.legend(loc='best', ncol=3, prop={'size':16}, columnspacing=0.5)
    plt.axis([1, 1e5, -1, 2])
    plt.title('Coefficient path')
    plt.xlabel('L2 penalty ($\lambda$)')
    plt.ylabel('Coefficient value')
    plt.xscale('log')
    plt.rcParams.update({'font.size': 18})
    plt.tight_layout()
In [19]:
make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list=[0, 4, 10, 1e2, 1e3, 1e5])

Compare accuracy

Now, let us compute the accuracy of the classifier model.

Recall from lecture that that the class prediction is calculated using $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & h(\mathbf{x}_i)^T\mathbf{w} > 0 \\ -1 & h(\mathbf{x}_i)^T\mathbf{w} \leq 0 \\ \end{array} \right. $$

In [20]:
def get_classification_accuracy(feature_matrix, sentiment, coefficients):
    scores = np.dot(feature_matrix, coefficients)
    apply_threshold = np.vectorize(lambda x: 1. if x > 0  else -1.)
    predictions = apply_threshold(scores)
    
    num_correct = (predictions == sentiment).sum()
    accuracy = num_correct / len(feature_matrix)    
    return accuracy

Below, we compare the accuracy on the training data and validation data for all the models that were trained in this assignment. We first calculate the accuracy values and then build a simple report summarizing the performance for the various models.

In [21]:
train_accuracy = {}
train_accuracy[0]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_0_penalty)
train_accuracy[4]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_4_penalty)
train_accuracy[10]  = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_10_penalty)
train_accuracy[1e2] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e2_penalty)
train_accuracy[1e3] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e3_penalty)
train_accuracy[1e5] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e5_penalty)

validation_accuracy = {}
validation_accuracy[0]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_0_penalty)
validation_accuracy[4]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_4_penalty)
validation_accuracy[10]  = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_10_penalty)
validation_accuracy[1e2] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e2_penalty)
validation_accuracy[1e3] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e3_penalty)
validation_accuracy[1e5] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e5_penalty)
In [22]:
for key in sorted(validation_accuracy.keys()):
    print "L2 penalty = %6g" % key, "--- train accuracy = %.7s, validation_accuracy = %.7s" % (train_accuracy[key], validation_accuracy[key])
L2 penalty =      0 --- train accuracy = 0.78515, validation_accuracy = 0.78143
L2 penalty =      4 --- train accuracy = 0.78510, validation_accuracy = 0.78153
L2 penalty =     10 --- train accuracy = 0.78499, validation_accuracy = 0.78171
L2 penalty =    100 --- train accuracy = 0.78397, validation_accuracy = 0.78106
L2 penalty =   1000 --- train accuracy = 0.77585, validation_accuracy = 0.77135
L2 penalty = 100000 --- train accuracy = 0.68036, validation_accuracy = 0.66781

The training accuracy is highest for the model with the lowest L2 penalty, logically. However, it is slightly overfit as the validation accuracy is highest when we penalize our coefficients with L2 Penalty = 10..