使用python,选择长于N的重复元素

Question

使用python,选择长于N的重复元素

Jon*_*eco 8 python conditional numpy pandas

假设我有一个如下数据帧:

df = pd.DataFrame({'A':[1,1,2,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,6,6]})

df
Out[1]: 
        A
    0   1
    1   1
    2   2
    3   3
    4   3
    5   3
    6   3
    7   3
    8   4
    9   4
    10  4
    11  4
    12  4
    13  4
    14  4
    15  5
    16  5
    17  5
    18  5
    19  6
    20  6

Run Code Online (Sandbox Code Playgroud)

我试图过滤重复4次或更多次的数字,输出结果如下:

Run Code Online (Sandbox Code Playgroud)

现在我正在使用itemfreq提取该信息,这导致一系列数组,然后复杂的条件和过滤这些数字.我认为必须有其他最简单的方法来做到这一点.一些想法？谢谢!

Answer 1

ayh*_*han 7

groupby.filter可能是最简单的方法:

df.groupby('A').filter(lambda x: x.size > 3)
Out: 
    A
3   3
4   3
5   3
6   3
7   3
8   4
9   4
10  4
11  4
12  4
13  4
14  4
15  5
16  5
17  5
18  5

Run Code Online (Sandbox Code Playgroud)

Answer 2

Div*_*kar 5

方法#1:其中一种NumPy方式是 -

a = df.A.values
unq, c = np.unique(a, return_counts=1)
df_out = df[np.in1d(a,unq[c>=4])]

Run Code Online (Sandbox Code Playgroud)

方法#2: NumPy + Pandas混合单线作为正数 -

df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))]

Run Code Online (Sandbox Code Playgroud)

方法#3:使用数据框在相关列上排序的事实,这是一个更深入的 NumPy方法 -

def filter_df(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    out_df = pd.DataFrame(out_arr)
    return out_df

Run Code Online (Sandbox Code Playgroud)

标杆

@piRSquared已覆盖了广泛的基准测试为迄今为止发布的所有方法,并因为它似乎pir1_5和div3似乎是top-2在那里.但时间似乎相当,它促使我仔细观察.在该基准测试中,我们timeit(stmt, setp, number=10)使用了恒定的迭代次数来运行timeit,这不是最可靠的计时方法,特别是对于小型数据集.此外,数据集似乎很小,因为最大数据集的时间安排在micro-sec.因此,为了缓解这两个问题 - 我建议使用IPython %timeit来自动计算运行时间的最佳迭代次数,即较小数据集的迭代次数比较大数据集的迭代次数要多.这应该更可靠.此外,我建议包括更大的数据集,以便时间进入milli-sec和sec.因此,通过这些改变,新的基准测试设置看起来像这样(记得复制并粘贴到IPython控制台上运行) -

sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000]
timings = np.zeros((len(sizes),2))
for i,s in enumerate(sizes):
    diffs = np.random.randint(100, size=s)
    d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs)))
    res = %timeit -oq div3(d)
    timings[i,0] = res.best
    res = %timeit -oq pir1_5(d)
    timings[i,1] = res.best
timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)')))
timings_df.index = sizes
timings_df.index.name = 'Datasizes'

Run Code Online (Sandbox Code Playgroud)

为了完整起见,方法是 -

def pir1_5(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return pd.DataFrame(v[(r > 3).repeat(r)])

def div3(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    return pd.DataFrame(out_arr)

Run Code Online (Sandbox Code Playgroud)

时序设置在IPython控制台上运行(因为我们使用的是魔法功能).结果看起来像这样 -

In [265]: timings_df
Out[265]: 
           div3(sec)  pir1_5(sec)
Datasizes                        
10          0.000090     0.000089
30          0.000096     0.000097
100         0.000109     0.000118
300         0.000157     0.000182
1000        0.000303     0.000396
3000        0.000713     0.000998
10000       0.002252     0.003701
100000      0.023257     0.036480
1000000     0.258133     0.398812
10000000    2.603467     3.759063

Run Code Online (Sandbox Code Playgroud)

因此,加速的数字div3超过pir1_5是:

In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0]
Out[266]: 
Datasizes
10          0.997704
30          1.016446
100         1.077129
300         1.163333
1000        1.304689
3000        1.400464
10000       1.643474
100000      1.568554
1000000     1.544987
10000000    1.443868

Run Code Online (Sandbox Code Playgroud)

Answer 3

piR*_*red 5

numpy
在此解决方案中,我跟踪值不等于后续值的位置.然后取np.diff这些位置给出了每个子序列的长度.
然后我可以测试长度是否> 3和使用np.repeat相同的长度列表来获得切片所需的布尔数组.

def pir(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return d[(r > 3).repeat(r)]

pir(df)

    A
3   3
4   3
5   3
6   3
7   3
8   4
9   4
10  4
11  4
12  4
13  4
14  4
15  5
16  5
17  5
18  5

Run Code Online (Sandbox Code Playgroud)

numba
使用numbas JIT编译器运行线性算法

from numba import njit

@njit
def bslc(a, o, p):
    ref = a[0]
    count = 1
    for i, x in enumerate(a[1:], 1):
        if x != ref:
            if count <= p:
                o[i-count:i] = False
            count = 1
            ref = x
        else:
            count += 1
    if count <= p:
        o[-count:] = False

    return o

def pir2(d):
    v = d.A.values
    return d[bslc(v, np.ones(v.size, bool), 3)]

Run Code Online (Sandbox Code Playgroud)

定时

res.div(res.min(1), 0).T

                 10         30         100        300        1000        3000        10000
pir1          2.534343   2.139391   2.270708   2.252734   2.163466    2.197770    2.356690
pir1_5        1.000000   1.000000   1.000000   1.000000   1.003564    1.000000    1.000000
pir2          2.296362   2.042509   2.102017   1.986853   2.001767    2.051514    2.275013
div1          2.982864   2.491685   2.737808   2.448306   2.941023    3.038580    3.281304
div2          3.801531   3.416633   3.383586   3.054023   3.256094    3.913530    3.632997
div3          1.319752   1.009515   1.041559   1.031727   1.000000    1.066009    1.243461
ayhan        12.511852  12.910074  21.728097  21.594083  36.871183   63.998537  101.748270
wen          12.978138  15.125419  15.116332  12.846738  16.342696   16.241073   16.043509
roman        18.241104  21.198169  32.914250  37.305237  70.204290  106.672777  178.032963
roman_v_pir  10.178496  11.122943  12.522402   8.874233  11.489024   12.244398   11.689981

Run Code Online (Sandbox Code Playgroud)

码

功能

def pir1(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return d[(r > 3).repeat(r)]

def pir1_5(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return pd.DataFrame(v[(r > 3).repeat(r)])

@njit
def bslc(a, o, p):
    ref = a[0]
    count = 1
    for i, x in enumerate(a[1:], 1):
        if x != ref:
            if count <= p:
                o[i-count:i] = False
            count = 1
            ref = x
        else:
            count += 1
    if count <= p:
        o[-count:] = False

    return o

def pir2(d):
    v = d.A.values
    return d[bslc(v, np.ones(v.size, bool), 3)]

def div1(d):
    a = d.A.values
    unq, c = np.unique(a, return_counts=1)
    return d[np.in1d(a, unq[c >= 4])]

def div2(df):
    return df[df.A.isin(np.flatnonzero(np.bincount(df.A) >= 4))]

def div3(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    return pd.DataFrame(out_arr)


ayhan = lambda df: df.groupby('A').filter(lambda x: x.size > 3)
wen = lambda df: df[df.A.isin(df.A.value_counts()[df.A.value_counts()>=4].index)]

roman = lambda df: df[df.groupby('A').A.transform(len)>3]
roman_v_pir = lambda df: df[df.groupby('A').A.transform('size')>3]

Run Code Online (Sandbox Code Playgroud)

测试

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000],
    columns='pir1 pir1_5 pir2 div1 div2 div3 ayhan wen roman roman_v_pir'.split(),
    dtype=float
)

for i in res.index:
    n = int(i ** .5) 
    diffs = np.random.randint(100, size=n)
    d = pd.DataFrame(dict(A=np.arange(n).repeat(diffs)))
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=10)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	153 次
最近记录：	8 年，5 月前