熊猫 - 找到没有Nan值的最长伸展

Jef*_*ist 7 python numpy pandas

我有一个pandas数据帧"df",其示例如下:

   time  x
0  1     1
1  2     Nan 
2  3     3
3  4     Nan
4  5     8
5  6     7
6  7     5
7  8     Nan
Run Code Online (Sandbox Code Playgroud)

真实的框架要大得多.我试图在"x"系列中找到最长的非NaN值,并打印出该帧的起始和结束索引.这可能吗?

谢谢

Div*_*kar 7

这是使用NumPy工具的矢量化方法 -

a = df.x.values  # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] ))  # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)   # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()]  # Get max interval, interval limits
Run Code Online (Sandbox Code Playgroud)

样品运行 -

In [474]: a
Out[474]: 
array([  1.,  nan,   3.,  nan,  nan,  nan,  nan,   8.,   7.,   5.,   2.,
         5.,  nan,  nan])

In [475]: start, stop
Out[475]: (7, 12)
Run Code Online (Sandbox Code Playgroud)

设置间隔使得每个开始和停止之间的差异将给出每个间隔的长度.所以,ending index如果你想获得非零元素的最后一个索引,我们需要从中减去一个stop.


piR*_*red 5

熊猫

f = dict(
    Start=pd.Series.first_valid_index,
    Stop=pd.Series.last_valid_index,
    Stretch='count'
)

agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values

array([ 4.,  6.])
Run Code Online (Sandbox Code Playgroud)

麻木的

def pir(x):
    # pad with np.nan
    x = np.append(np.nan, np.append(x, np.nan))
    # find where null
    w = np.where(np.isnan(x))[0]
    # diff to find length of stretch
    # argmax to find where largest stretch
    a = np.diff(w).argmax()
    # return original positions of boundary nulls
    return w[[a, a + 1]] + np.array([0, -2])
Run Code Online (Sandbox Code Playgroud)

演示

pir(df.x.values)

array([4, 6])
Run Code Online (Sandbox Code Playgroud)
a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)

array([ 7, 11])
Run Code Online (Sandbox Code Playgroud)