use linear regression to predict the house price

the original data is from the cousera machine learning course.
when I finished the exercise with MATLAB, the idea about implementing the algorithm with python comes out.
so I’d like to refresh the knowledge and have fun with the data:)
the algorithm is so simple that you can scan it quickly, and save your time.:)

1. focus on the data

as a data scientist, what data do you have means how far can you deep into the superface of the data.
i nead to load data, and keep eyes on what scheme is the data stored.

import numpy as np

def load_data(filename):
    data = []
    with open(filename,'rb') as f:
        for line in f:
            line = line.decode('utf-8').strip().split(',')
            data.append([int(_) for _ in line])
    return data

filename = r'ex1data2.txt'
data = load_data(filename)
# look at the first three line of the data
print('\n'.join([str(data[i]) for i in range(3)]))

[2104, 3, 399900]
[1600, 3, 329900]
[2400, 3, 369000]

2. math model

so, what the three integers in the first line mean?
the first element 2104 means house width, the second element 3 means the house depth, and the last one means the price.
it is time to choose our math model to deal with the data.
apparently, the article pay attention to linear regression.
all right, the model is linear regression.

to find the parameters θ0,θ1,θ2 of hypothesis price=θ0+θ1x1+θ2x2

  1. initialize the vector θ=[θ0,θ1,θ2]

  2. minimize the error: error=0.5mmi=1(price(xi)yi))2

  3. to achieve the minimization we use the gradient descent algorithm due to the cost function is a convex function.

talk is cheap, show me the code.

3. implement

# normalization
data = np.array(data)

x = data[:,[0,1]]
y = data[:,2]

mu = np.mean(x, axis=0)
std = np.std(x, axis=0)

x = (x-mu)/std

row = x.shape[0]
X = np.ones((row,3))
X[:,[1,2]] = x
X = np.matrix(X)
# get the X to computation
theta = np.zeros((3,1))
theta = np.matrix(theta)
y = np.matrix(y)
#implement grad descent method
def grad_descent(X, y, theta, iter_num, alpha):
    m = len(y)
    for _ in range(iter_num):
        theta -= alpha/m*(X.T*X*theta-X.T*y.T)
    return theta

# initialize the parameters
iter_num = 900
alpha = 0.01

new_theta = grad_descent(X, y, theta, iter_num, alpha)
print('the theta parameter is:')
print(new_theta)
# Estimate the price of a 1650 sq-ft, 3 br house
price = np.dot(np.array([1, (1650-mu[0])/std[0], (3-mu[1])/std[1]]), new_theta)
print('for a 1650 sq-ft, 3 br house,the price is')
print(price)

the theta parameter is:
[[ 340412.65957447]
 [ 109447.79646964]
 [  -6578.35485416]]
for a 1650 sq-ft, 3 br house,the price is
[[ 293081.4643349]]

3. Normal Euqation

when the number of featuers in data is below 1000.
we always use normal equation to compute theta.

what the relationship between these two methods?

θn+1=θnα/m(XXθXy)

when n becomes infinite the θn+1=θn and XXθXy=0

so θ=inv(XX)Xy

new_X = np.ones((47,3))
new_X[:,1:] = data[:,:2]
new_X = np.matrix(new_X)
new_theta1 = np.linalg.pinv(new_X.T*new_X)*new_X.T*y.T
print(new_theta1)

[[ 89597.90954435]
 [   139.21067402]
 [ -8738.01911278]]



new_price = np.dot(np.array([1, 1650, 3]), new_theta1)
print(new_price)

[[ 293081.46433506]]

the two result is close enough.