如何使用传统 python 或使用 pandas/numpy/sciy 在列表中按顺序选择重复项的第一次出现

RTM*_*RTM 3 python numpy scipy pandas pandas-groupby

假设有一个列表“系列”,其中在多个索引值处有一些重复元素。有没有办法找到数字重复序列的第一次出现。

series = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]
Run Code Online (Sandbox Code Playgroud)

返回值应类似于 [5,11,17,21],它们是 [16,16] 、 [16,16,16] 、 [17,17] 和 [18,18] 重复序列第一次出现的索引值]

Div*_*kar 5

这是使用数组切片来提高性能的一个,类似于@piRSquared\'s second solution但没有任何附加/连接 -

\n\n
a = np.array(series)\nout = np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1\n
Run Code Online (Sandbox Code Playgroud)\n\n

样本运行 -

\n\n
In [28]: a = np.array(series)\n\nIn [29]: np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1\nOut[29]: array([ 5, 11, 17, 21])\n
Run Code Online (Sandbox Code Playgroud)\n\n

运行时测试(针对工作解决方案)

\n\n

方法 -

\n\n
def piRSquared1(series):\n    d = np.flatnonzero(np.diff(series) == 0)\n    w = np.append(True, np.diff(d) > 1)\n    return d[w].tolist()\n\ndef piRSquared2(series):\n    s = np.array(series)\n    return np.flatnonzero(\n        np.append(s[:-1] == s[1:], True) &\n        np.append(True, s[1:] != s[:-1])\n    ).tolist()\n\ndef Zach(series):\n    s = pd.Series(series)\n    i = [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]\n    return i\n\ndef jezrael(series):\n    s = pd.Series(series)\n    s1 = s.shift(1).ne(s).cumsum()\n    m = ~s1.duplicated() & s1.duplicated(keep=False)\n    s2 = m.index[m].tolist()\n    return s2    \n\ndef divakar(series):\n    a = np.array(series)\n    x = a[1:-1]\n    return (np.flatnonzero((a[2:] == x) & (x != a[:-2]))+1).tolist()\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于设置,我们只是将样本输入平铺多次。

\n\n

时间安排 -

\n\n

案例#1:大集合

\n\n
In [34]: series0 = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]\n\nIn [35]: series = np.tile(series0,10000).tolist()\n\nIn [36]: %timeit piRSquared1(series)\n    ...: %timeit piRSquared2(series)\n    ...: %timeit Zach(series)\n    ...: %timeit jezrael(series)\n    ...: %timeit divakar(series)\n    ...: \n100 loops, best of 3: 8.06 ms per loop\n100 loops, best of 3: 7.79 ms per loop\n1 loop, best of 3: 3.88 s per loop\n10 loops, best of 3: 24.3 ms per loop\n100 loops, best of 3: 7.97 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

案例#2:更大的集合(在前 2 个解决方案上)

\n\n
In [40]: series = np.tile(series0,1000000).tolist()\n\nIn [41]: %timeit piRSquared2(series)\n1 loop, best of 3: 823 ms per loop\n\nIn [42]: %timeit divakar(series)\n1 loop, best of 3: 823 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

现在,这两种解决方案的不同之处仅在于后一种解决方案中避免附加的方式。让我们仔细看看它们并在较小的数据集上运行 -

\n\n
In [43]: series = np.tile(series0,100).tolist()\n\nIn [44]: %timeit piRSquared2(series)\n10000 loops, best of 3: 89.4 \xc2\xb5s per loop\n\nIn [45]: %timeit divakar(series)\n10000 loops, best of 3: 82.8 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

因此,它表明,在处理较小的数据集时,后一种解决方案中避免串联/追加的做法有很大帮助,但在处理较大的数据集时,它们变得具有可比性。

\n\n

通过其中的一个串联,可以对较大的数据集进行边际改进。因此,最后一步可以重写为:

\n\n
np.flatnonzero(np.concatenate(([False],(a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))))\n
Run Code Online (Sandbox Code Playgroud)\n