我用Python numpy.
我有一个numpy索引数组a:
>>> a
array([[5, 7],
[12, 18],
[20, 29]])
>>> type(a)
<type 'numpy.ndarray'>
Run Code Online (Sandbox Code Playgroud)
我有一个numpy索引数组b:
>>> b
array([[2, 4],
[8, 11],
[33, 35]])
>>> type(b)
<type 'numpy.ndarray'>
Run Code Online (Sandbox Code Playgroud)
我需要加入一个数组a的数组b:
a+ b=>[2, 4] [5, 7] [8, 11] [12, 18] [20, 29] [33, 35]
=> a并且b有索引数组=> [2, 18] [20, 29] [33, 35]
(索引([2, 4][5, 7][8, 11][12, 18])按顺序进行
=> 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18=> [2, 18])
对于这个例子:
>>> …Run Code Online (Sandbox Code Playgroud) 哪一个更快?一个"更好"吗?基本上我会有两套,我想最终从两个列表中得到一个匹配.所以我觉得for循环更像是:
for object in set:
if object in other_set:
return object
Run Code Online (Sandbox Code Playgroud)
就像我说的 - 我只需要一场比赛,但我不确定如何intersection()处理,所以我不知道它是否更好.此外,如果它有帮助,这other_set是一个近100,000个组件的列表,set可能是几百,最多几千.
我试图找到一种更有效的方法,根据特定列(id)查找数据框中的重叠数据范围(每行提供的开始/结束日期).
Dataframe在"from"列上排序
我认为有一种方法可以像我一样避免"双重"应用功能......
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
Run Code Online (Sandbox Code Playgroud)
我使用"apply"函数循环所有组,在每个组中,我每行使用"apply":
def check_date_by_id(df):
df['prevFrom'] = df['from'].shift()
df['prevTo'] = df['to'].shift()
def check_date_by_row(x):
if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
x['overlap'] = False
return x
latest_start = max(x['from'], x.prevFrom)
earliest_end = min(x['to'], x.prevTo)
x['overlap'] = …Run Code Online (Sandbox Code Playgroud)