Multiple Regression with Gradient Descent

In [30]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

Load packages, data, functions

In [6]:
import graphlab
import numpy as np 
In [5]:
sales = graphlab.SFrame('kc_house_data.gl/')

Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:

  • A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
  • A numpy array containing the values of the output
In [7]:
# All features in matrix, output in array
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 # add constant variable
    features = ['constant'] + features # create list of features
    features_sframe = data_sframe[features] # create sframe with all features
    feature_matrix = features_sframe.to_numpy() # convert to np matrix
    
    output_sarray = data_sframe[output] # select the output variable
    output_array = output_sarray.to_numpy() # convert to np array
    return(feature_matrix, output_array)

Computing the Derivative

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [11]:
def feature_derivative(errors, feature):
    derivative = 2 * np.dot(errors, feature)
    return(derivative)

Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [12]:
from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)
In [13]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    while not converged:
        predictions = np.dot(feature_matrix, weights)
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient sum of squares
        for i in range(len(weights)): # loop over each weight
            derivative = feature_derivative(errors, feature_matrix[:, i])
            gradient_sum_squares += derivative ** 2
            weights[i] -= step_size * derivative
        gradient_magnitude = sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

Simple Regression

First let's split the data into training and test data.

In [14]:
train_data,test_data = sales.random_split(.8,seed=0)

Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:

In [15]:
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

Next run your gradient descent with the above parameters.

In [16]:
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,tolerance)
simple_weights[1]
Out[16]:
281.91211911641625

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:

In [18]:
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)
testpred = predict_output(test_simple_feature_matrix, simple_weights)

Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output).

In [19]:
rss1 = ((testpred - test_output)**2).sum()
print rss1
2.75400047593e+14

Multiple Regression

Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:

In [20]:
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Use the above parameters to estimate the model weights. Record these values for your quiz.

In [21]:
model_weights = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)

Use your newly estimated weights and the predict_output function to compute the predictions on the TEST data. Don't forget to create a numpy array for these features from the test set first!

In [23]:
(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)
predictions2 = predict_output(test_feature_matrix, model_weights)

Now use your predictions and the output to compute the RSS for model 2 on TEST data.

In [24]:
rss2 = ((predictions2 - test_output)**2).sum()
print rss2
2.70263446465e+14
In [29]:
print "Simple model RSS: %s, Multiple model RSS: %s" % (rss1, rss2)
Simple model RSS: 2.75400047593e+14, Multiple model RSS: 2.70263446465e+14