熊猫：当一列达到另一列的某个值时，如何返回行值？

Question

熊猫：当一列达到另一列的某个值时，如何返回行值？

bbe*_*t36 14 python performance time python-3.x pandas

这是数据示例：

目标：
为running_bid_max 大于或等于中的值创建一个新的时间戳列ask_price_target_good。然后，创建一个单独的时间戳列，用于当running_bid_min是小于或等于 ask_price_target_bad。

注意：这将对大量数据执行，并且需要尽快计算出需求。我希望我不必通过遍历所有行iterrows()

running_bid_min和running_bid_max使用计算出的running.min()，并pd.running.max()在将来某个时间帧（本实施例中是使用5分钟的时间线。因此，这将是运行分钟，从当前时间最多5分钟）

复制下面的数据，然后使用 df = pd.read_clipboard(sep=',')

   time,bid_price,ask_price,running_bid_max,running_bid_min,ask_price_target_good,ask_price_target_bad
2019-07-24 07:59:44.432034,291.06,291.26,291.4,291.09,291.46,291.06
2019-07-24 07:59:46.393418,291.1,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:48.425615,291.1,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:50.084206,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:52.326455,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:54.428181,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:58.550378,291.14,291.35,291.4,291.2,291.55,291.15
2019-07-24 08:00:00.837238,291.2,291.35,291.4,291.2,291.55,291.15
2019-07-24 08:00:57.338769,291.4,291.46,291.51,291.4,291.66,291.26
2019-07-24 08:00:59.058198,291.4,291.46,291.96,291.4,291.66,291.26
2019-07-24 08:01:00.802679,291.4,291.46,291.96,291.4,291.66,291.26
2019-07-24 08:01:02.781289,291.4,291.46,291.96,291.45,291.66,291.26
2019-07-24 08:01:04.645144,291.45,291.46,291.96,291.45,291.66,291.26
2019-07-24 08:01:06.491997,291.45,291.46,292.07,291.45,291.66,291.26
2019-07-24 08:01:08.586688,291.45,291.46,292.1,291.45,291.66,291.26

Run Code Online (Sandbox Code Playgroud)

Answer 1

Qua*_*ang 11

根据您的问题：

为running_bid_max大于或等于中的值创建一个新的timestamp列ask_price_target_good。然后为when running_bid_min小于或等于创建一个单独的timestamp列ask_price_target_bad

这个问题似乎微不足道：

df['g'] = np.where(df.running_bid_max.ge(df.ask_price_target_good), df['time'], pd.NaT)

df['l'] = np.where(df.running_bid_min.le(df.ask_price_target_bad), df['time'], pd.NaT)

Run Code Online (Sandbox Code Playgroud)

还是我错过了什么？

更新：你可能要ffill和bfill上面的命令后：

df['g'] = df['g'].bfill()
df['l'] = df['l'].ffill()

Run Code Online (Sandbox Code Playgroud)

输出，例如df['g']：

0    2019-07-24 08:00:59.058198
1    2019-07-24 08:00:59.058198
2    2019-07-24 08:00:59.058198
3    2019-07-24 08:00:59.058198
4    2019-07-24 08:00:59.058198
5    2019-07-24 08:00:59.058198
6    2019-07-24 08:00:59.058198
7    2019-07-24 08:00:59.058198
8    2019-07-24 08:00:59.058198
9    2019-07-24 08:00:59.058198
10   2019-07-24 08:01:00.802679
11   2019-07-24 08:01:02.781289
12   2019-07-24 08:01:04.645144
13   2019-07-24 08:01:06.491997
14   2019-07-24 08:01:08.586688

Run Code Online (Sandbox Code Playgroud)

Answer 2

Dre*_*rey 5

It would be very nice if you could print the desired output. Otherwise I may miss the logic.

If you are working on large amount of data, it makes sense to apply steaming analytics*. (This will quite memory efficient and if you use cytoolz even 2-4 times faster)

So basically you would like to partition your data based on either one or the other condition:

partitions = toolz.partitionby(lambda x: (x['running_bid_max'] >= x['ask_price_target_good']) or
                                         (x['running_bid_min'] <= x['ask_price_target_bad']), data_stream)

Run Code Online (Sandbox Code Playgroud)

Whatever you will do with individual partitions is up to you (you can create addtional fields or columns etc.).

print([(part[0]['time'], part[-1]['time'], 
        part[0]['running_bid_max'] > part[0]['ask_price_target_good'],
        part[0]['running_bid_min'] > part[0]['ask_price_target_bad']) 
       for part in partitions])

Run Code Online (Sandbox Code Playgroud)

[('2019-07-24T07:59:46.393418', '2019-07-24T07:59:46.393418', False, False), 
 ('2019-07-24T07:59:44.432034', '2019-07-24T07:59:44.432034', False,  True), 
 ('2019-07-24T07:59:48.425615', '2019-07-24T07:59:54.428181', False, False), 
 ('2019-07-24T07:59:58.550378', '2019-07-24T08:00:57.338769', False,  True), 
 ('2019-07-24T08:00:59.058198', '2019-07-24T08:01:08.586688',  True,  True)]

Run Code Online (Sandbox Code Playgroud)

Also note that it is easy to create individual DataFrames

info_cols = ['running_bid_max', 'ask_price_target_good', 'running_bid_min', 'ask_price_target_bad', 'time'] 
data_frames = [pandas.DataFrame(_)[info_cols] for _ in partitions]
data_frames

Run Code Online (Sandbox Code Playgroud)

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.53           291.09                291.13  2019-07-24T07:59:46.393418

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.46           291.09                291.06  2019-07-24T07:59:44.432034

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.53           291.09                291.13  2019-07-24T07:59:48.425615
1            291.4                 291.53           291.09                291.13  2019-07-24T07:59:50.084206
2            291.4                 291.53           291.09                291.13  2019-07-24T07:59:52.326455
3            291.4                 291.53           291.09                291.13  2019-07-24T07:59:54.428181

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0           291.40                 291.55            291.2                291.15  2019-07-24T07:59:58.550378
1           291.40                 291.55            291.2                291.15  2019-07-24T08:00:00.837238
2           291.51                 291.66            291.4                291.26  2019-07-24T08:00:57.338769

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0           291.96                 291.66           291.40                291.26  2019-07-24T08:00:59.058198
1           291.96                 291.66           291.40                291.26  2019-07-24T08:01:00.802679
2           291.96                 291.66           291.45                291.26  2019-07-24T08:01:02.781289
3           291.96                 291.66           291.45                291.26  2019-07-24T08:01:04.645144
4           292.07                 291.66           291.45                291.26  2019-07-24T08:01:06.491997
5           292.10                 291.66           291.45                291.26  2019-07-24T08:01:08.586688

Run Code Online (Sandbox Code Playgroud)

Unfortunatly I couldn't find a one liner pytition_by for DataFrame. It surely is hidden somewhere. (But again, pandas usually loads all data into memory - if you want to aggregate during I/O than streaming could be a way to go.)

*Streaming Example

For example, lets us create a simple csv stream:

[('2019-07-24T07:59:46.393418', '2019-07-24T07:59:46.393418', False, False), 
 ('2019-07-24T07:59:44.432034', '2019-07-24T07:59:44.432034', False,  True), 
 ('2019-07-24T07:59:48.425615', '2019-07-24T07:59:54.428181', False, False), 
 ('2019-07-24T07:59:58.550378', '2019-07-24T08:00:57.338769', False,  True), 
 ('2019-07-24T08:00:59.058198', '2019-07-24T08:01:08.586688',  True,  True)]

Run Code Online (Sandbox Code Playgroud)

that yields one processed row at a time:

next(data_stream())

{'time': '2019-07-24T07:59:46.393418',
 'bid_price': 291.1,
 'ask_price': 291.33,
 'running_bid_max': 291.4,
 'running_bid_min': 291.09,
 'ask_price_target_good': 291.53,
 'ask_price_target_bad': 291.13}

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	611 次
最近记录：	6 年，4 月前