Numpy数组条件匹配

Question

Numpy数组条件匹配

我需要匹配两个非常大的Numpy阵列(一个是20000行,另一个是大约100000行),我正在尝试构建一个脚本来高效地完成它.简单地在数组上循环是非常慢的,有人可以提出更好的方法吗？这是我想要做的:数组datesSecondDict和数组pwfs2Dates包含日期时间值,我需要从数组pwfs2Dates(较小的数组)中获取每个日期时间值,并查看数组中是否有类似的日期时间值(加上减去5分钟)datesSecondDict(可能超过1).如果有一个(或多个)我使用数组中pwfs2Dates的值(其中一个值)填充一个新数组(与数组大小相同)valsSecondDict(这只是具有相应数值的数组datesSecondDict).这是@unutbu和@joaquin为我工作的解决方案(谢谢大家!):

import time
import datetime as dt
import numpy as np

def combineArs(dict1, dict2):
   """Combine data from 2 dictionaries into a list.
   dict1 contains primary data (e.g. seeing parameter).
   The function compares each timestamp in dict1 to dict2
   to see if there is a matching timestamp record(s)
   in dict2 (plus/minus 5 minutes).
   ==If yes: a list called data gets appended with the
   corresponding parameter value from dict2.
   (Note that if there are more than 1 record matching,
   the first occuring value gets appended to the list).
   ==If no: a list called data gets appended with 0."""
   # Specify the keys to use    
   pwfs2Key = 'pwfs2:dc:seeing'
   dimmKey = 'ws:seeFwhm'

   # Create an iterator for primary dict 
   datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes'])

   # Take the first timestamp value in primary dict
   nextDatePrimDict = next(datesPrimDictIter)

   # Split the second dictionary into lists
   datesSecondDict = dict2[dimmKey]['datetime']
   valsSecondDict  = dict2[dimmKey]['values']

   # Define time window
   fiveMins = dt.timedelta(minutes = 5)
   data = []
   #st = time.time()
   for i, nextDateSecondDict in enumerate(datesSecondDict):
       try:
           while nextDatePrimDict < nextDateSecondDict - fiveMins:
               # If there is no match: append zero and move on
               data.append(0)
               nextDatePrimDict = next(datesPrimDictIter)
           while nextDatePrimDict < nextDateSecondDict + fiveMins:
               # If there is a match: append the value of second dict
               data.append(valsSecondDict[i])
               nextDatePrimDict = next(datesPrimDictIter)
       except StopIteration:
           break
   data = np.array(data)   
   #st = time.time() - st    
   return data

Run Code Online (Sandbox Code Playgroud)

谢谢,艾娜.

Answer 1

joa*_*uin 6

数组日期是否排序？

如果是,那么一旦日期大于外部循环给出的日期,就可以通过打破内部循环比较来加快比较.通过这种方式,您将进行一次通过比较,而不是循环dimVals项目len(pwfs2Vals)时间
如果不是,也许您应该将当前pwfs2Dates数组转换为例如一对数组,[(date, array_index),...]然后您可以按日期对所有数组进行排序,以进行上面所示的一次通过比较,同时可以获得原始数组需要设置的索引data[i]

例如,如果数组已经排序(我在这里使用列表,不确定你需要数组):( 编辑:现在使用和迭代器不要从每一步开始循环pwfs2Dates):

pdates = iter(enumerate(pwfs2Dates))
i, datei = pdates.next() 

for datej, valuej in zip(dimmDates, dimvals):
    while datei < datej - fiveMinutes:
        i, datei = pdates.next()
    while datei < datej + fiveMinutes:
        data[i] = valuej
        i, datei = pdates.next()

Run Code Online (Sandbox Code Playgroud)

否则,如果它们没有被排序,你就像这样创建了已排序的索引列表:

pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)])
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])

Run Code Online (Sandbox Code Playgroud)

代码将是:
(编辑:现在使用和迭代器未在每个步骤循环pwfs2Dates从开始):

pdates = iter(pwfs2Dates)
datei, i = pdates.next()

for datej, j in dimmDates:
    while datei < datej - fiveMinutes:
        datei, i = pdates.next()
    while datei < datej + fiveMinutes:
        data[i] = dimVals[j]
        datei, i = pdates.next()

Run Code Online (Sandbox Code Playgroud)

大!

..

请注意dimVals:
```
dimVals  = np.array(dict1[dimmKey]['values'])
```
Run Code Online (Sandbox Code Playgroud)
未在代码中使用,可以删除.
请注意,通过循环遍历数组本身而不是使用xrange,可以大大简化代码

编辑:unutbu的答案解决了上面代码中的一些弱点.我在这里指出它们的完整性:

使用next:next(iterator)优先iterator.next(). iterator.next()是一个常规命名规则的例外,已在py3k中修复此方法重命名为 iterator.__next__().
用a检查迭代器的结尾try/except.迭代器中的所有项完成后,下一次调用将next() 产生StopIteration异常.try/except 当发生这种情况时,请尽量摆脱循环.对于OP问题的具体情况,这不是问题,因为两个arrray的大小相同,因此for循环与迭代器同时完成.所以没有异常上升.但是,可能存在dict1和dict2的大小不同的情况.在这种情况下,存在异常升级的可能性.问题是:什么是更好的,使用try/except或在循环之前准备数组,将它们均衡为较短的数组.

归档时间：	14 年，5 月前
查看次数：	1459 次
最近记录：	14 年，5 月前