我正在尝试用Python创建一个简单的线性模型,不使用库(numpy除外).这就是我所拥有的
import numpy as np
import pandas
np.random.seed(1)
alpha = 0.1
def h(x,w):
return np.dot(w.T,x)
def cost(X,W,Y):
totalCost = 0
for i in range(47):
diff = h(X[i],W) - Y[i]
squared = diff * diff
totalCost += squared
return totalCost / 2
housing_data = np.loadtxt('Housing.csv',delimiter=',')
x1 = housing_data[:,0]
x2 = housing_data[:,1]
y = housing_data[:,2]
avgX1 = np.mean(x1)
stdX1 = np.std(x1)
normX1 = (x1 - avgX1) / stdX1
print('avgX1',avgX1)
print('stdX1',stdX1)
avgX2 = np.mean(x2)
stdX2 = np.std(x2)
normX2 = (x2 - avgX2) / stdX2
print('avgX2',avgX2)
print('stdX2',stdX2)
normalizedX = np.ones((47,3))
normalizedX[:,1] = normX1
normalizedX[:,2] = normX2
np.savetxt('normalizedX.csv',normalizedX)
weights = np.ones((3,))
for boom in range(100):
currentCost = cost(normalizedX,weights,y)
if boom % 1 == 0:
print(boom,'iteration',weights[0],weights[1],weights[2])
print('Cost',currentCost)
for i in range(47):
errorDiff = h(normalizedX[i],weights) - y[i]
weights[0] = weights[0] - alpha * (errorDiff) * normalizedX[i][0]
weights[1] = weights[1] - alpha * (errorDiff) * normalizedX[i][1]
weights[2] = weights[2] - alpha * (errorDiff) * normalizedX[i][2]
print(weights)
predictedX = [1,(2100 - avgX1) / stdX1,(3 - avgX2) / stdX2]
firstPrediction = np.array(predictedX)
print('firstPrediction',firstPrediction)
firstPrediction = h(firstPrediction,weights)
print(firstPrediction)
首先,它很快收敛.仅经过14次迭代.其次,它给出了与sklearn的线性回归不同的结果.作为参考,我的sklearn代码是:
import numpy
import matplotlib.pyplot as plot
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
dataset = pandas.read_csv('Housing.csv',header=None)
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,2].values
linearRegressor = LinearRegression()
xnorm = sklearn.preprocessing.scale(x)
scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)
mean = scaleCoef.mean_
std = numpy.sqrt(scaleCoef.var_)
print('stf')
print(std)
stuff = linearRegressor.fit(xnorm,y)
predictedX = [[(2100 - mean[0]) / std[0],(3 - mean[1]) / std[1]]]
yPrediction = linearRegressor.predict(predictedX)
print('predictedX',predictedX)
print('predict',yPrediction)
print(stuff.coef_,stuff.intercept_)
我的自定义模型预测为y值为337,000,sklearn预测为355,000.我的数据是47行,看起来像
2104,3,3.999e+05
1600,3.299e+05
2400,3.69e+05
1416,2,2.32e+05
3000,4,5.399e+05
1985,2.999e+05
1534,3.149e+05
完整数据可在https://github.com/shamoons/linear-logistic-regression/blob/master/Housing.csv获得
我假设(a)我的梯度下降回归在某种程度上是错误的或(b)我没有正确地使用sklearn.
为什么2不会为给定输入预测相同输出的任何其他原因?
最佳答案
我认为你在梯度下降中缺少1 / m项(其中m是y的大小).在包含1 / m项之后,我似乎得到了与您的sklearn代码类似的预测值.
见下文
....
weights = np.ones((3,))
m = y.size
for boom in range(100):
currentCost = cost(normalizedX,weights) - y[i]
weights[0] = weights[0] - alpha *(1/m)* (errorDiff) * normalizedX[i][0]
weights[1] = weights[1] - alpha *(1/m)* (errorDiff) * normalizedX[i][1]
weights[2] = weights[2] - alpha *(1/m)* (errorDiff) * normalizedX[i][2]
...
这使得第一个预测值为355242.
这与线性回归模型很好地吻合,即使它不做梯度下降.
我也尝试了sklearn中的sgdregressor(使用随机梯度下降),它似乎也得到一个接近线性回归模型和你的模型的值.请参阅下面的代码
import numpy
import matplotlib.pyplot as plot
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,SGDRegressor
dataset = pandas.read_csv('Housing.csv',2].values
sgdRegressor = SGDRegressor(penalty='none',learning_rate='constant',eta0=0.1,max_iter=1000,tol = 1E-6)
xnorm = sklearn.preprocessing.scale(x)
scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)
mean = scaleCoef.mean_
std = numpy.sqrt(scaleCoef.var_)
print('stf')
print(std)
yPrediction = []
predictedX = [[(2100 - mean[0]) / std[0],(3 - mean[1]) / std[1]]]
print('predictedX',predictedX)
for trials in range(10):
stuff = sgdRegressor.fit(xnorm,y)
yPrediction.extend(sgdRegressor.predict(predictedX))
print('predict',np.mean(yPrediction))
结果是
predict 355533.10119985335