Mer*_*emu 7 python optimization dynamic-programming linear-regression
我正在尝试创建分段线性回归以最小化 MSE(最小平方误差),然后直接使用线性回归。该方法应使用动态规划来计算不同的分段大小和组的组合,以实现整体 MSE。我认为算法运行时是 O(n²),我想知道是否有办法将其优化为 O(nLogN)?
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
x = [3.4, 1.8, 4.6, 2.3, 3.1, 5.5, 0.7, 3.0, 2.6, 4.3, 2.1, 1.1, 6.1, 4.8,3.8]
y = [26.2, 17.8, 31.3, 23.1, 27.5, 36.5, 14.1, 22.3, 19.6, 31.3, 24.0, 17.3, 43.2, 36.4, 26.1]
dataset = np.dstack((x,y))
dataset = dataset[0]
d_arg = np.argsort(dataset[:,0])
dataset = dataset[d_arg]
def calc_error(dataset):
lr_model = linear_model.LinearRegression()
x = pd.DataFrame(dataset[:,0])
y = pd.DataFrame(dataset[:,1])
lr_model.fit(x,y)
predictions = lr_model.predict(x)
mse = mean_squared_error(y, predictions)
return mse
#n is the number of points , m is the number of groups, k is the minimum number of points in a group
#?15?5?3?returns ??3?3?3?3?3??
#?15?5?2? returns [[2,4,3,3,3],[3,2,4,2,4],[4,2,3,3,3]....]
def all_combination(n,m,k):
result = []
if n < k*m:
print('There are not enough elements to split.')
return
combination_bottom = [k for q in range(m)]
#add greedy algorithm here?
if n == k*m:
result.append(combination_bottom.copy())
else:
combination_now = [combination_bottom.copy()]
j = k*m+1
while j < n+1:
combination_last = combination_now.copy()
combination_now = []
for x in combination_last:
for i in range (0, m):
combination_new = x.copy()
combination_new[i] = combination_new[i]+1
combination_now.append(combination_new.copy())
j += 1
else:
for x in combination_last:
for i in range (0, m):
combination_new = x.copy()
combination_new[i] = combination_new[i]+1
if combination_new not in result:
result.append(combination_new.copy())
return result #2-d list
def calc_sum_error(dataset,cb):#cb = combination
mse_sum = 0
for n in range(0,len(cb)):
if n == 0:
low = 0
high = cb[0]
else:
low = 0
for i in range(0,n):
low += cb[i]
high = low + cb[n]
mse_sum += calc_error(dataset[low:high])
return mse_sum
#k is the number of points as a group
def best_piecewise(dataset,k):
lenth = len(dataset)
max_split = lenth // k
min_mse = calc_error(dataset)
split_cb = []
all_cb = []
for i in range(2, max_split+1):
split_result = all_combination(lenth, i, k)
all_cb += split_result
for cb in split_result:
tmp_mse = calc_sum_error(dataset,cb)
if tmp_mse < min_mse:
min_mse = tmp_mse
split_cb = cb
return min_mse, split_cb, all_cb
min_mse, split_cb, all_cb = best_piecewise(dataset, 2)
print('The best split of the data is '+str(split_cb))
print('The minimum MSE value is '+str(min_mse))
x = np.array(dataset[:,0])
y = np.array(dataset[:,1])
plt.plot(x,y,"o")
for n in range(0,len(split_cb)):
if n == 0:
low = 0
high = split_cb[n]
else:
low = 0
for i in range(0,n):
low += split_cb[i]
high = low + split_cb[n]
x_tmp = pd.DataFrame(dataset[low:high,0])
y_tmp = pd.DataFrame(dataset[low:high,1])
lr_model = linear_model.LinearRegression()
lr_model.fit(x_tmp,y_tmp)
y_predict = lr_model.predict(x_tmp)
plt.plot(x_tmp, y_predict, 'g-')
plt.show()
Run Code Online (Sandbox Code Playgroud)
如果我没有在任何部分说清楚,请告诉我。
我花了一些时间才意识到,您所描述的问题正是决策树回归器试图解决的问题。
不幸的是,最佳决策树的构建是 NP 难的,这意味着即使使用动态编程,您也无法将运行时间降低到 O(NlogN) 之类的任何东西。
好消息是,你可以直接使用任何维护良好的决策树实现,DecisionTreeRegressor的sklearn.tree模块,例如,可以是某些有关获得在O(NlogN)的时间复杂度最佳的性能。要强制每组最少点数,请使用min_samples_leaf参数。您还可以控制其他几个属性,例如 maximun of no。使用max_leaf_nodes,优化 wrt 不同的损失函数criterion等。
如果您想知道 Scikit-learn 的决策树与您的算法(即split_cb在您的代码中)学习的决策树相比如何:
X = np.array(x).reshape(-1,1)
dt = DecisionTreeRegressor(min_samples_leaf=MIN_SIZE).fit(X,y)
split_cb = np.unique(dt.apply(X),return_counts=True)[1]
Run Code Online (Sandbox Code Playgroud)
然后使用您使用的相同绘图代码。请注意,由于您的时间复杂度远高于 O(NlogN)*,因此您的实现通常会找到比 scikit-learn 的贪心算法更好的分割。
[1] Hyafil, L., & Rivest, RL (1976)。构建最优二元决策树是 np-complete。信息处理快报, 5(1), 15–17
*虽然我不确定您的实现的确切时间复杂度,但它肯定比 O(N^2) 差,all_combination(21,4,2)花费了 5 分钟以上。