In [3]:

```
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
```

In [2]:

```
from __future__ import division
import graphlab
```

In [4]:

```
products = graphlab.SFrame('amazon_baby_subset.gl/')
```

In [6]:

```
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
important_words = json.load(f)
important_words = [str(s) for s in important_words]
def remove_punctuation(text):
import string
return text.translate(None, string.punctuation)
# Remove punctuation.
products['review_clean'] = products['review'].apply(remove_punctuation)
# Split out the words into individual columns
for word in important_words:
products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
```

In [7]:

```
train_data, validation_data = products.random_split(.8, seed=2)
```

In [8]:

```
import numpy as np
def get_numpy_data(data_sframe, features, label):
data_sframe['intercept'] = 1
features = ['intercept'] + features
features_sframe = data_sframe[features]
feature_matrix = features_sframe.to_numpy()
label_sarray = data_sframe[label]
label_array = label_sarray.to_numpy()
return(feature_matrix, label_array)
```

We convert both the training and validation sets into NumPy arrays.

In [9]:

```
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')
```

The link function for logistic regression can be defined as following, where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of **important_words** in the review $\mathbf{x}_i$.
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

In [10]:

```
def predict_probability(feature_matrix, coefficients):
scores = np.dot(feature_matrix, coefficients)
probabilities = 1. / (1. + np.exp(-scores))
return probabilities
```

To calculate the per-coefficient derivative of log likelihood, it takes only a small modification to add a L2 penalty compared to unregularized regression. All terms indicated in **red** refer to terms that were added due to an **L2 penalty**. We have sum(features * errors - 2 lambda coefficient), or more neat:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j }
$$
As we did in the Regression course, we do not apply the L2 penalty on the intercept. A large intercept does not necessarily indicate overfitting because the intercept is not associated with any particular feature. So for the intercept term, we have
$$
\frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

In [11]:

```
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant):
derivative = np.dot(errors, feature)
if not feature_is_constant:
derivative -= 2 * l2_penalty * coefficient
return derivative
```

This function computes the log likelihood (used here for its numerical stability). $$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) \color{red}{-\lambda\|\mathbf{w}\|_2^2} $$

In [12]:

```
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
indicator = (sentiment==+1)
scores = np.dot(feature_matrix, coefficients)
lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
return lp
```

In [13]:

```
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
coefficients = np.array(initial_coefficients) # make sure it's a numpy array
for itr in xrange(max_iter):
predictions = predict_probability(feature_matrix, coefficients)
indicator = (sentiment==+1)
errors = indicator - predictions
for j in xrange(len(coefficients)): # loop over each coefficient
is_intercept = (j == 0)
derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
coefficients[j] += step_size * derivative
# Checking whether log likelihood is increasing
if (itr == (max_iter-1)):
lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
print 'iteration %*d: log likelihood of observed labels = %.8f with a penalty of %s' % (int(np.ceil(np.log10(max_iter))), itr, lp, l2_penalty)
return coefficients
```

Now that we have written up all the pieces needed for regularized logistic regression, let's explore the benefits of using **L2 regularization** in analyzing sentiment for product reviews. **As iterations pass, the log likelihood should increase as we ascend the hill**.

Below, we train models with increasing amounts of regularization, starting with no L2 penalty.

In [14]:

```
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=0, max_iter=501)
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=4, max_iter=501)
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=10, max_iter=501)
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e2, max_iter=501)
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e3, max_iter=501)
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train, initial_coefficients=np.zeros(194), step_size=5e-6, l2_penalty=1e5, max_iter=501)
```

The higher our penalty, the lower the likelihood, as we increase our training error.

We now compare the **coefficients** for each of the models that were trained above. We will create a table of features and learned coefficients associated with each of the different L2 penalty values. Below is a simple helper function that will help us create this table.

In [15]:

```
table = graphlab.SFrame({'word': ['(intercept)'] + important_words})
def add_coefficients_to_table(coefficients, column_name):
table[column_name] = coefficients
return table
```

Now, let's run the function `add_coefficients_to_table`

for each of the L2 penalty strengths.

In [16]:

```
add_coefficients_to_table(coefficients_0_penalty, '[L2=0]')
add_coefficients_to_table(coefficients_4_penalty, '[L2=4]')
add_coefficients_to_table(coefficients_10_penalty, '[L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, '[L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, '[L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, '[L2=1e5]')[1:4]
```

Out[16]:

In [17]:

```
positive_words = table.sort(['[L2=10]'], ascending=False).head(5)['word']
negative_words = table.sort(['[L2=10]'], ascending=True).head(5)['word']
print positive_words
print negative_words
```

In [18]:

```
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6
def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):
cmap_positive = plt.get_cmap('Reds')
cmap_negative = plt.get_cmap('Blues')
xx = l2_penalty_list
plt.plot(xx, [0.]*len(xx), '--', lw=1, color='k')
table_positive_words = table.filter_by(column_name='word', values=positive_words)
table_negative_words = table.filter_by(column_name='word', values=negative_words)
del table_positive_words['word']
del table_negative_words['word']
for i in xrange(len(positive_words)):
color = cmap_positive(0.8*((i+1)/(len(positive_words)*1.2)+0.15))
plt.plot(xx, table_positive_words[i:i+1].to_numpy().flatten(), '-', label=positive_words[i], linewidth=4.0, color=color)
for i in xrange(len(negative_words)):
color = cmap_negative(0.8*((i+1)/(len(negative_words)*1.2)+0.15))
plt.plot(xx, table_negative_words[i:i+1].to_numpy().flatten(), '-', label=negative_words[i], linewidth=4.0, color=color)
plt.legend(loc='best', ncol=3, prop={'size':16}, columnspacing=0.5)
plt.axis([1, 1e5, -1, 2])
plt.title('Coefficient path')
plt.xlabel('L2 penalty ($\lambda$)')
plt.ylabel('Coefficient value')
plt.xscale('log')
plt.rcParams.update({'font.size': 18})
plt.tight_layout()
```

In [19]:

```
make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list=[0, 4, 10, 1e2, 1e3, 1e5])
```

Now, let us compute the accuracy of the classifier model.

Recall from lecture that that the class prediction is calculated using $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & h(\mathbf{x}_i)^T\mathbf{w} > 0 \\ -1 & h(\mathbf{x}_i)^T\mathbf{w} \leq 0 \\ \end{array} \right. $$

In [20]:

```
def get_classification_accuracy(feature_matrix, sentiment, coefficients):
scores = np.dot(feature_matrix, coefficients)
apply_threshold = np.vectorize(lambda x: 1. if x > 0 else -1.)
predictions = apply_threshold(scores)
num_correct = (predictions == sentiment).sum()
accuracy = num_correct / len(feature_matrix)
return accuracy
```

**training data** and **validation data** for all the models that were trained in this assignment. We first calculate the accuracy values and then build a simple report summarizing the performance for the various models.

In [21]:

```
train_accuracy = {}
train_accuracy[0] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_0_penalty)
train_accuracy[4] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_4_penalty)
train_accuracy[10] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_10_penalty)
train_accuracy[1e2] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e2_penalty)
train_accuracy[1e3] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e3_penalty)
train_accuracy[1e5] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e5_penalty)
validation_accuracy = {}
validation_accuracy[0] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_0_penalty)
validation_accuracy[4] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_4_penalty)
validation_accuracy[10] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_10_penalty)
validation_accuracy[1e2] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e2_penalty)
validation_accuracy[1e3] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e3_penalty)
validation_accuracy[1e5] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e5_penalty)
```

In [22]:

```
for key in sorted(validation_accuracy.keys()):
print "L2 penalty = %6g" % key, "--- train accuracy = %.7s, validation_accuracy = %.7s" % (train_accuracy[key], validation_accuracy[key])
```