# Precision and Recall¶

The goal of this assignment is to understand precision-recall in the context of classifiers.

In [18]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')  In [2]: import graphlab from __future__ import division import numpy as np import string graphlab.canvas.set_target('ipynb')  In [4]: products = graphlab.SFrame('amazon_baby.gl/')  # Preparations¶ ## Extract word counts and sentiments¶ We compute the word counts for individual words and extract positive and negative sentiments from ratings. In [5]: def remove_punctuation(text): return text.translate(None, string.punctuation) # Remove punctuation, count words review_clean = products['review'].apply(remove_punctuation) products['word_count'] = graphlab.text_analytics.count_words(review_clean) # Drop neutral sentiment reviews, +/-1 to others. products = products[products['rating'] != 3] products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)  In [6]: products.head(1)  Out[6]: name review rating word_count sentiment Planetwise Wipe Pouch it came early and was not disappointed. i love ... 5.0 {'and': 3L, 'love': 1L, 'it': 3L, 'highly': 1L, ... 1 [1 rows x 5 columns] ## Training and test split¶ In [7]: train_data, test_data = products.random_split(.8, seed=1)  ## Train a logistic classifier¶ In [8]: model = graphlab.logistic_classifier.create(train_data, target='sentiment', features=['word_count'], validation_set=None, verbose=False)  # Model Evaluation¶ ## Accuracy¶ One performance metric we will use for our more advanced exploration is accuracy. Recall that the accuracy is given by $$\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}$$ In [9]: accuracy= model.evaluate(test_data, metric='accuracy')['accuracy'] print "Test Accuracy: %s" % accuracy  Test Accuracy: 0.914536837053  ## Majority class prediction¶ The majority class classifier is a baseline (i.e reference) model for a point of comparison with a more sophisticated classifier. The majority classifier model predicts the majority class for all data points. Typically, a good model should beat the majority class classifier. Since the majority class in this dataset is the positive class (i.e., there are more positive than negative reviews), the accuracy of the majority class classifier can be computed as follows: In [10]: baseline = len(test_data[test_data['sentiment'] == 1])/len(test_data) print "Baseline accuracy (majority class classifier): %s" % baseline  Baseline accuracy (majority class classifier): 0.842782577394  ## Confusion Matrix¶ The accuracy, while convenient, does not tell the whole story. For a fuller picture, we turn to the confusion matrix. In the case of binary classification, the confusion matrix is a 2-by-2 matrix laying out correct and incorrect predictions made in each label as follows: In [11]: confusion_matrix = model.evaluate(test_data, metric='confusion_matrix')['confusion_matrix'] confusion_matrix  Out[11]: target_label predicted_label count -1 -1 3798 -1 1 1443 1 -1 1406 1 1 26689 [4 rows x 3 columns] ## Computing the cost of mistakes¶ Suppose you know the costs involved in each kind of mistake: \$100 for each false positive and \\$1 for each false negative. What is the cost of the model?

In [12]:
1443*100+1406

Out[12]:
145706

## Precision and Recall¶

You may not have exact dollar amounts for each kind of mistake. Instead, you may simply prefer to reduce the percentage of false positives to be less than, say, 3.5% of all positive predictions. This is where precision comes in:

$$[\text{precision}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all data points with positive predictions]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false positives}]}$$

So to keep the percentage of false positives below 3.5% of positive predictions, we must raise the precision to 96.5% or higher.

In [14]:
precision = model.evaluate(test_data, metric='precision')['precision']
print "Precision on test data: %s" % precision

Precision on test data: 0.948706099815

In [15]:
print("False positives: %s") % (1-precision)

False positives: 0.0512939001848


A complementary metric is recall, which measures the ratio between the number of true positives and that of (ground-truth) positive reviews:

$$[\text{recall}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all positive data points]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false negatives}]}$$

Let us compute the recall on the test_data.

In [16]:
recall = model.evaluate(test_data, metric='recall')['recall']
print "Recall on test data: %s" % recall

Recall on test data: 0.949955508098


We first examine what happens when we use a different threshold value for making class predictions. We then explore a range of threshold values and plot the associated precision-recall curve.

## Varying the threshold¶

In [79]:
def apply_threshold(probabilities, threshold):
return (probabilities >= threshold)


## Precision-recall curve¶

Now, we will explore various different values of tresholds, compute the precision and recall scores, and then plot the precision-recall curve.

In [85]:
threshold_values = np.linspace(0.5, 1, num=100)
print threshold_values[:5]
print threshold_values[-5:]

[ 0.5         0.50505051  0.51010101  0.51515152  0.52020202]
[ 0.97979798  0.98484848  0.98989899  0.99494949  1.        ]


For each of the values of threshold, we compute the precision and recall scores.

In [86]:
precision_all = []
recall_all = []

probabilities = model.predict(test_data, output_type='probability')
for threshold in threshold_values:
predictions = apply_threshold(probabilities, threshold)

precision = sum((predictions == test_data['sentiment'])* (predictions == 1)) / float(sum(predictions == 1))
recall = sum((predictions == test_data['sentiment'])* (predictions == 1)) / float(sum(test_data['sentiment'] == 1))

precision_all.append(precision)
recall_all.append(recall)


Now, let's plot the precision-recall curve to visualize the precision-recall tradeoff as we vary the threshold.

In [87]:
import matplotlib.pyplot as plt
%matplotlib inline

def plot_pr_curve(precision, recall, title):
plt.rcParams['figure.figsize'] = 7, 5
plt.locator_params(axis = 'x', nbins = 5)
plt.plot(precision, recall, 'b-', linewidth=4.0, color = '#B0017F')
plt.title(title)
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.rcParams.update({'font.size': 16})

plot_pr_curve(precision_all, recall_all, 'Precision recall curve (all)')


Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better?

In [88]:
for t, p in zip(threshold_values, precision_all):
if p >= 0.965:
print ("For a treshold value of %s, we get a precision of %s") % (t,p)
break

For a treshold value of 0.838383838384, we get a precision of 0.965311550152