如何在Pandas系列中找到与输入数字最接近的值？

Question

如何在Pandas系列中找到与输入数字最接近的值？

Ste*_*eve 30 python ranking dataframe pandas

我见过:

这些与香草蟒蛇有关,而不是熊猫.

如果我有这个系列:

Run Code Online (Sandbox Code Playgroud)

我输入3,我怎样才能(有效地)找到？

如果在系列中找到,则索引为3
如果在系列中找不到,则该值的索引低于和高于3.

IE浏览器.使用上面的系列{1,6,4,5,2}和输入3,我应该得到带有索引(2,4)的值(4,2).

Answer 1

Zer*_*ero 34

你可以使用argsort()喜欢

说, input = 3

In [198]: input = 3

In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

Run Code Online (Sandbox Code Playgroud)

df_sort 是具有2个最接近值的数据帧.

In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]

Run Code Online (Sandbox Code Playgroud)

对于索引,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

Run Code Online (Sandbox Code Playgroud)

对于价值观,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]

Run Code Online (Sandbox Code Playgroud)

细节,对于上面的解决方案df是

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2

Run Code Online (Sandbox Code Playgroud)

这给出了错误的答案.我在更复杂的数据集上尝试了这个.您必须使用.iloc而不是.ix并且它运行良好(请参阅@ op1) (3认同)
这是找到上下两个最接近的，还是两个最接近的？ (2认同)
我需要找到a）上面最接近的数字，b）下面最接近的数字。因此，绝对差异不会在所有情况下都实现。 (2认同)

Answer 2

小智 18

我建议iloc除了John Galt的答案之外还使用它,因为即使使用未排序的整数索引,这也会起作用,因为.ix首先查看索引标签

df.iloc[(df['num']-input).abs().argsort()[:2]]

Run Code Online (Sandbox Code Playgroud)

Answer 3

小智 11

如果序列已经排序，则查找索引的有效方法是使用二等分函数。一个例子：

idx = bisect_left(df['num'].values, 3)

Run Code Online (Sandbox Code Playgroud)

让我们考虑col数据帧的列df 已排序。

如果该值val在列中，bisect_left 将返回该值在列表中的精确索引，并将 bisect_right返回下一个位置的索引。
如果值不在列表中，则和bisect_left 都会bisect_right返回相同的索引：插入值以保持列表排序的索引。

val因此，为了回答这个问题，下面的代码给出了in的索引col（如果找到），否则给出最接近值的索引。即使列表中的值不唯一，此解决方案也有效。

idx = bisect_left(df['num'].values, 3)

Run Code Online (Sandbox Code Playgroud)

from bisect import bisect_left, bisect_right

Run Code Online (Sandbox Code Playgroud)

二等分算法可以非常有效地查找数据帧列“col”中特定值“val”的索引或其最接近的邻居，但它需要对列表进行排序。

Answer 4

Ivo*_*ers 7

除了不能完全回答问题之外，这里讨论的其他算法的另一个缺点是它们必须对整个列表进行排序。这导致〜N log（N）的复杂性。

然而，也可以实现同样的结果〜n的。这种方法将数据帧分为两个子集，一个子集小于期望值。在较小的数据帧中，较低的邻居比最大值大，对于较高的邻居，反之亦然。

这给出了以下代码片段：

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index[0]
  else:
      lowerneighbour_ind = df[df.num<value].idxmax()
      upperneighbour_ind = df[df.num>value].idxmin()
      return lowerneighbour_ind, upperneighbour_ind

Run Code Online (Sandbox Code Playgroud)

这种方法类似于在熊猫中使用分区，这在处理大型数据集时变得非常有用，而复杂性成为一个问题。

两种策略的比较表明，对于较大的N，分区策略的确更快。对于较小的N，排序策略将以更低的级别实现，因此效率更高。它也是单行代码，可能会提高代码的可读性。

复制此图的代码如下所示：

from matplotlib import pyplot as plt
import pandas
import numpy
import timeit

value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)

sort_results, partition_results=[],[]
for size in sizes:
    df=pandas.DataFrame({"num":100*numpy.random.random(size)})

    sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
                                         globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
    partition_results.append(timeit.Timer('find_neighbours(df,value)',
                                          globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())

sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]

plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')

Run Code Online (Sandbox Code Playgroud)

@ClaudiuCreanga 这取决于“df”的大小。我添加了一个图和一些代码来说明这种行为。 (2认同)

归档时间：	10 年，8 月前
查看次数：	33640 次
最近记录：	6 年，3 月前