Jef*_*ist 7 python numpy pandas
我有一个pandas数据帧"df",其示例如下:
time x
0 1 1
1 2 Nan
2 3 3
3 4 Nan
4 5 8
5 6 7
6 7 5
7 8 Nan
Run Code Online (Sandbox Code Playgroud)
真实的框架要大得多.我试图在"x"系列中找到最长的非NaN值,并打印出该帧的起始和结束索引.这可能吗?
谢谢
这是使用NumPy工具的矢量化方法 -
a = df.x.values # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] )) # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2) # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()] # Get max interval, interval limits
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [474]: a
Out[474]:
array([ 1., nan, 3., nan, nan, nan, nan, 8., 7., 5., 2.,
5., nan, nan])
In [475]: start, stop
Out[475]: (7, 12)
Run Code Online (Sandbox Code Playgroud)
设置间隔使得每个开始和停止之间的差异将给出每个间隔的长度.所以,ending index如果你想获得非零元素的最后一个索引,我们需要从中减去一个stop.
熊猫
f = dict(
Start=pd.Series.first_valid_index,
Stop=pd.Series.last_valid_index,
Stretch='count'
)
agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values
array([ 4., 6.])
Run Code Online (Sandbox Code Playgroud)
麻木的
def pir(x):
# pad with np.nan
x = np.append(np.nan, np.append(x, np.nan))
# find where null
w = np.where(np.isnan(x))[0]
# diff to find length of stretch
# argmax to find where largest stretch
a = np.diff(w).argmax()
# return original positions of boundary nulls
return w[[a, a + 1]] + np.array([0, -2])
Run Code Online (Sandbox Code Playgroud)
演示
pir(df.x.values)
array([4, 6])
Run Code Online (Sandbox Code Playgroud)
a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)
array([ 7, 11])
Run Code Online (Sandbox Code Playgroud)