use linear regression to predict the house price
the original data is from the cousera machine learning course.
when I finished the exercise with MATLAB, the idea about implementing the algorithm with python comes out.
so I’d like to refresh the knowledge and have fun with the data:)
the algorithm is so simple that you can scan it quickly, and save your time.:)
1. focus on the data
as a data scientist, what data do you have means how far can you deep into the superface of the data.
i nead to load data, and keep eyes on what scheme is the data stored.
import numpy as np
def load_data(filename):
data = []
with open(filename,'rb') as f:
for line in f:
line = line.decode('utf-8').strip().split(',')
data.append([int(_) for _ in line])
return data
filename = r'ex1data2.txt'
data = load_data(filename)
# look at the first three line of the data
print('\n'.join([str(data[i]) for i in range(3)]))
[2104, 3, 399900]
[1600, 3, 329900]
[2400, 3, 369000]
2. math model
so, what the three integers in the first line mean?
the first element 2104 means house width, the second element 3 means the house depth, and the last one means the price.
it is time to choose our math model to deal with the data.
apparently, the article pay attention to linear regression.
all right, the model is linear regression.
to find the parameters
initialize the vector
θ=[θ0,θ1,θ2] minimize the error:
error=0.5m∗∑mi=1(price(xi)−yi))2 to achieve the minimization we use the gradient descent algorithm due to the cost function is a convex function.
talk is cheap, show me the code.
3. implement
# normalization
data = np.array(data)
x = data[:,[0,1]]
y = data[:,2]
mu = np.mean(x, axis=0)
std = np.std(x, axis=0)
x = (x-mu)/std
row = x.shape[0]
X = np.ones((row,3))
X[:,[1,2]] = x
X = np.matrix(X)
# get the X to computation
theta = np.zeros((3,1))
theta = np.matrix(theta)
y = np.matrix(y)
#implement grad descent method
def grad_descent(X, y, theta, iter_num, alpha):
m = len(y)
for _ in range(iter_num):
theta -= alpha/m*(X.T*X*theta-X.T*y.T)
return theta
# initialize the parameters
iter_num = 900
alpha = 0.01
new_theta = grad_descent(X, y, theta, iter_num, alpha)
print('the theta parameter is:')
print(new_theta)
# Estimate the price of a 1650 sq-ft, 3 br house
price = np.dot(np.array([1, (1650-mu[0])/std[0], (3-mu[1])/std[1]]), new_theta)
print('for a 1650 sq-ft, 3 br house,the price is')
print(price)
the theta parameter is:
[[ 340412.65957447]
[ 109447.79646964]
[ -6578.35485416]]
for a 1650 sq-ft, 3 br house,the price is
[[ 293081.4643349]]
3. Normal Euqation
when the number of featuers in data is below 1000.
we always use normal equation to compute theta.
what the relationship between these two methods?
when n becomes infinite the
so
new_X = np.ones((47,3))
new_X[:,1:] = data[:,:2]
new_X = np.matrix(new_X)
new_theta1 = np.linalg.pinv(new_X.T*new_X)*new_X.T*y.T
print(new_theta1)
[[ 89597.90954435]
[ 139.21067402]
[ -8738.01911278]]
new_price = np.dot(np.array([1, 1650, 3]), new_theta1)
print(new_price)
[[ 293081.46433506]]
the two result is close enough.