Ced*_*olo 6 python numpy pandas ransac scikit-learn
我正在尝试检测连续的跨度,其中相关变量在DataFrame中的某些数据中线性变化.数据中可能存在许多满足此要求的跨度.我使用ransac基于使用RANSAC的鲁棒线性模型估计开始了我的方法.但是,我在使用该示例进行数据时遇到问题.
检测相关变量,其中相关变量在数据内线性变化.要检测的跨度由超过20个连续数据点组成.期望的输出将是放置连续跨度的范围日期.
在下面的玩具示例代码中,我生成随机数据,然后设置数据的两个部分以创建线性变化的连续跨度.然后我尝试将线性回归模型拟合到数据中.我使用的其余代码(这里没有显示)只是使用RANSAC页面的Robust线性模型估计中的其余代码.但是我知道我需要更改剩余的代码才能达到目标.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Plot data
df.plot()
plt.show()
## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())
## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
Run Code Online (Sandbox Code Playgroud)
对于这个玩具示例代码,所需的输出(我还无法编码)将是这样的DataFrame:
>>> out
start end
0 2016-08-10 08:15 2016-08-10 15:00
1 2016-08-10 17:00 2016-08-10 22:30
Run Code Online (Sandbox Code Playgroud)
但是,当执行第6步时,我得到以下错误:
ValueError:预期的2D数组,改为获得1D数组:...如果数据具有单个特征,则使用array.reshape(-1,1)重新整形数据;如果包含单个特征,则使用array.reshape(1,-1)重塑数据样品.
我希望能够在这个例子中检测出相关变量,其中相关变量线性变化(line1和line2).但我无法实现ransac代码示例中所述的示例.
我应该在代码中修改什么才能继续?并且,是否有更好的方法来实现检测相关变量线性变化的连续跨度?
回答关于ValueError的问题:你得到错误的原因而不是示例的原因是,当你最初创建一个具有形状的数组时(100,1)(如示例),线性模型适合于df.data.tolist()具有形状(100,).这可以通过重新整形X为2D 来修复X = X.reshape(-1,1).下一个错误是X值不能datetime64格式化.然后可以通过将时间转换为秒来修复.例如,要使用的标准时期是1970-01-01T00:00Z,然后所有数据点都是自该日期和时间以来的秒数.此转换可以通过以下方式完成:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Run Code Online (Sandbox Code Playgroud)
这是完整的代码,显示下图中的线性拟合:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Run Code Online (Sandbox Code Playgroud)
如您所述,为了检测线性数据的跨度,RANSAC是一种很好的使用方法.为此,线性模型将更改为lr = linear_model.RANSACRegressor().但是,这只会返回一个跨度,而您需要检测所有跨度.这意味着您需要重复跨度检测,同时在每次检测后删除跨度,以便不再检测到它们.应该重复这一过程,直到检测到的跨度中的点数小于20.
RANSAC拟合的剩余阈值需要非常小,以便不会在跨度之外拾取点.该residual_threshold如果在真实数据的任何噪音是可以改变的.然而,这并不总是足够的,并且可能会发现错误的内点,这将影响记录的跨度范围.
由于RANSAC没有检查跨度点是否是连续的,因此异常值可能错误地包含在跨度中.为防止出现这种情况,标记为跨度的点如果被异常值包围,则应更改为异常值.最快的方法就是lr.inlier_mask_与之合作[1,1,1].卷积后任何单独的"内点"将具有值1(因此实际上是异常值),而作为跨度运行的一部分的点将是2或3.因此,以下将修复错误的内点:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Run Code Online (Sandbox Code Playgroud)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times = []
end_times = []
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Run Code Online (Sandbox Code Playgroud)
这是out数据帧:
您还可以绘制跨度以进行验证.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
Run Code Online (Sandbox Code Playgroud)
小智 2
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
Run Code Online (Sandbox Code Playgroud)
It is because sklearn is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x: \
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Run Code Online (Sandbox Code Playgroud)
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
Run Code Online (Sandbox Code Playgroud)
If you are trying to do something similar, but for float data, I would do something using diff the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
| 归档时间: |
|
| 查看次数: |
260 次 |
| 最近记录: |