Python在没有for循环的情况下更有效地迭代pandas

use*_*006 2 python iteration pandas

我正在创建一个列,为某些字符串添加标记,并在此处输入代码:

import pandas as pd
import numpy as np
import re

data=pd.DataFrame({'Lang':["Python", "Cython", "Scipy", "Numpy", "Pandas"], })
data['Type'] = ""


pat = ["^P\w", "^S\w"]

for i in range (len(data.Lang)):
    if re.search(pat[0],data.Lang.ix[i]):
        data.Type.ix[i] = "B"

    if re.search(pat[1],data.Lang.ix[i]):
        data.Type.ix[i]= "A"


print data
Run Code Online (Sandbox Code Playgroud)

有没有办法摆脱那个for循环?如果它numpy有一个arange类似于我想要找到的功能.

Jef*_*eff 6

这将比apply soln(和循环soln)更快

仅供参考:(这是0.13).在0.12中,您需要先创建Type列.

In [36]: data.loc[data.Lang.str.match(pat[0]),'Type'] = 'B'

In [37]: data.loc[data.Lang.str.match(pat[1]),'Type'] = 'A'

In [38]: data
Out[38]: 
     Lang Type
0  Python    B
1  Cython  NaN
2   Scipy    A
3   Numpy  NaN
4  Pandas    B

[5 rows x 2 columns]

In [39]: data.fillna('')
Out[39]: 
     Lang Type
0  Python    B
1  Cython     
2   Scipy    A
3   Numpy     
4  Pandas    B

[5 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)

这是一些时间:

In [34]: bigdata = pd.concat([data]*2000,ignore_index=True)

In [35]: def f3(df):
    df = df.copy()
    df['Type'] = ''
    for i in range(len(df.Lang)):
        if re.search(pat[0],df.Lang.ix[i]):
            df.Type.ix[i] = 'B'
        if re.search(pat[1],df.Lang.ix[i]):
            df.Type.ix[i] = 'A'
   ....:             

In [36]: def f2(df):
    df = df.copy()
    df.loc[df.Lang.str.match(pat[0]),'Type'] = 'B'
    df.loc[df.Lang.str.match(pat[1]),'Type'] = 'A'
    df.fillna('')
   ....:     

In [37]: def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
   ....:     
Run Code Online (Sandbox Code Playgroud)

你原来的解决方案

In [41]: %timeit f3(bigdata)
1 loops, best of 3: 2.21 s per loop
Run Code Online (Sandbox Code Playgroud)

直接索引

In [42]: %timeit f2(bigdata)
100 loops, best of 3: 17.3 ms per loop
Run Code Online (Sandbox Code Playgroud)

应用

In [43]: %timeit f1(bigdata)
10 loops, best of 3: 21.8 ms per loop
Run Code Online (Sandbox Code Playgroud)

这是另一种更通用的方法,它更快一点,而且prob更有用,因为你可以根据需要将模式组合成一个groupby.

In [107]: pats
Out[107]: {'A': '^P\\w', 'B': '^S\\w'}

In [108]: concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
Out[108]: 
      Lang    A    B
0   Python    A  NaN
1   Cython  NaN  NaN
2    Scipy  NaN    B
3    Numpy  NaN  NaN
4   Pandas    A  NaN
5   Python    A  NaN
6   Cython  NaN  NaN

45  Python    A  NaN
46  Cython  NaN  NaN
47   Scipy  NaN    B
48   Numpy  NaN  NaN
49  Pandas    A  NaN
50  Python    A  NaN
51  Cython  NaN  NaN
52   Scipy  NaN    B
53   Numpy  NaN  NaN
54  Pandas    A  NaN
55  Python    A  NaN
56  Cython  NaN  NaN
57   Scipy  NaN    B
58   Numpy  NaN  NaN
59  Pandas    A  NaN
       ...  ...  ...

[10000 rows x 3 columns]

In [106]: %timeit  concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
100 loops, best of 3: 15.5 ms per loop
Run Code Online (Sandbox Code Playgroud)

对于每个将字母放在正确位置(而另外还有NaN)的图案,此框架会在系列上添加.

创建一系列那封信

Series(c,index=df.index)
Run Code Online (Sandbox Code Playgroud)

从中选择匹配项

Series(c,index=df.index)[df.Lang.str.match(p)]
Run Code Online (Sandbox Code Playgroud)

重新索引将NaN放在值不在索引中的位置

Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index))
Run Code Online (Sandbox Code Playgroud)