In [30]:

```
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
```

In [6]:

```
import graphlab
import numpy as np
```

In [5]:

```
sales = graphlab.SFrame('kc_house_data.gl/')
```

Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:

- A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
- A numpy array containing the values of the output

In [7]:

```
# All features in matrix, output in array
def get_numpy_data(data_sframe, features, output):
data_sframe['constant'] = 1 # add constant variable
features = ['constant'] + features # create list of features
features_sframe = data_sframe[features] # create sframe with all features
feature_matrix = features_sframe.to_numpy() # convert to np matrix
output_sarray = data_sframe[output] # select the output variable
output_array = output_sarray.to_numpy() # convert to np array
return(feature_matrix, output_array)
```

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [11]:

```
def feature_derivative(errors, feature):
derivative = 2 * np.dot(errors, feature)
return(derivative)
```

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of *increase* and therefore the negative gradient is the direction of *decrease* and we're trying to *minimize* a cost function.

The amount by which we move in the negative gradient *direction* is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [12]:

```
from math import sqrt # recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)
```

In [13]:

```
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False
weights = np.array(initial_weights) # make sure it's a numpy array
while not converged:
predictions = np.dot(feature_matrix, weights)
errors = predictions - output
gradient_sum_squares = 0 # initialize the gradient sum of squares
for i in range(len(weights)): # loop over each weight
derivative = feature_derivative(errors, feature_matrix[:, i])
gradient_sum_squares += derivative ** 2
weights[i] -= step_size * derivative
gradient_magnitude = sqrt(gradient_sum_squares)
if gradient_magnitude < tolerance:
converged = True
return(weights)
```

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

First let's split the data into training and test data.

In [14]:

```
train_data,test_data = sales.random_split(.8,seed=0)
```

In [15]:

```
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
```

Next run your gradient descent with the above parameters.

In [16]:

```
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,tolerance)
simple_weights[1]
```

Out[16]:

In [18]:

```
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)
testpred = predict_output(test_simple_feature_matrix, simple_weights)
```

In [19]:

```
rss1 = ((testpred - test_output)**2).sum()
print rss1
```

In [20]:

```
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9
```

Use the above parameters to estimate the model weights. Record these values for your quiz.

In [21]:

```
model_weights = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)
```

In [23]:

```
(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)
predictions2 = predict_output(test_feature_matrix, model_weights)
```

Now use your predictions and the output to compute the RSS for model 2 on TEST data.

In [24]:

```
rss2 = ((predictions2 - test_output)**2).sum()
print rss2
```

In [29]:

```
print "Simple model RSS: %s, Multiple model RSS: %s" % (rss1, rss2)
```