pythonic和uFunc-y方式将pandas列转换为"增加"索引？

Question

pythonic和uFunc-y方式将pandas列转换为"增加"索引？

假设我有一只像这样的熊猫df:

Index   A     B
0      foo    3
1      foo    2
2      foo    5
3      bar    3
4      bar    4
5      baz    5

Run Code Online (Sandbox Code Playgroud)

添加如下列的快速方法是什么:

Index   A     B    Aidx
0      foo    3    0
1      foo    2    0
2      foo    5    0
3      bar    3    1
4      bar    4    1
5      baz    5    2

Run Code Online (Sandbox Code Playgroud)

即为每个唯一值添加一个增加的索引？

我知道我可以使用df.unique(),然后使用dict并enumerate创建一个查找,然后应用该字典查找来创建列.但我觉得应该有更快的方式,可能涉及groupby一些特殊的功能？

Answer 1

sac*_*cuL 7

一种方法是使用ngroup.只记得你必须确保你的groupby没有使用组来获得你想要的输出,所以设置sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()
>>> df
   Index    A  B  Aidx
0      0  foo  3     0
1      1  foo  2     0
2      2  foo  5     0
3      3  bar  3     1
4      4  bar  4     1
5      5  baz  5     2

Run Code Online (Sandbox Code Playgroud)

Answer 2

WeN*_*Ben 7

不需要groupby使用

方法1factorize

pd.factorize(df.A)[0]
array([0, 0, 0, 1, 1, 2], dtype=int64)
#df['Aidx']=pd.factorize(df.A)[0]

Run Code Online (Sandbox Code Playgroud)

方法2 sklearn

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.A)
LabelEncoder()
le.transform(df.A)
array([2, 2, 2, 0, 0, 1])

Run Code Online (Sandbox Code Playgroud)

方法3 cat.codes

df.A.astype('category').cat.codes

Run Code Online (Sandbox Code Playgroud)

方法4 map+unique

l=df.A.unique()
df.A.map(dict(zip(l,range(len(l)))))
0    0
1    0
2    0
3    1
4    1
5    2
Name: A, dtype: int64

Run Code Online (Sandbox Code Playgroud)

方法5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)
y
array([2, 2, 2, 0, 0, 1], dtype=int64)

Run Code Online (Sandbox Code Playgroud)

编辑:OP的数据帧的一些时间

"""

%timeit pd.factorize(view.Company)[0]

The slowest run took 6.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 155 µs per loop

%timeit view.Company.astype('category').cat.codes

The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 449 µs per loop

from itertools import izip

%timeit l = view.Company.unique(); view.Company.map(dict(izip(l,xrange(len(l)))))

1000 loops, best of 3: 666 µs per loop

import numpy as np

%timeit np.unique(view.Company.values, return_inverse=True)

The slowest run took 8.08 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.7 µs per loop

Run Code Online (Sandbox Code Playgroud)

看似numpy胜利.

Answer 3

Rav*_*h13 5

这样做的另一种方法可能是.

df['C'] = i.ne(df.A.shift()).cumsum()-1
df

Run Code Online (Sandbox Code Playgroud)

当我们打印df价值时,它将如下.

  Index  A    B  C
0  0     foo  3  0
1  1     foo  2  0 
2  2     foo  5  0 
3  3     bar  3  1 
4  4     bar  4  1 
5  5     baz  5  2

Run Code Online (Sandbox Code Playgroud)

解决方案的解释:为了理解目的,让我们将解决方案分解为多个部分.

第1步:通过将其值向下移动到自身来比较df的A列,如下所示.

i.ne(df.A.shift())

Run Code Online (Sandbox Code Playgroud)

我们得到的输出是:

0     True
1    False
2    False
3     True
4    False
5     True

Run Code Online (Sandbox Code Playgroud)

第二步:使用cumsum()函数,所以无论何时TRUE值到来(当找不到A列的匹配及其移位时),它将调用cumsum()函数并且其值将增加.

i.ne(df.A.shift()).cumsum()-1
0    0
1    0
2    0
3    1
4    1
5    2
Name: A, dtype: int32

Run Code Online (Sandbox Code Playgroud)

第三步:保存命令的值到df['C']这将创建一个名为新列C在df.

归档时间：	7 年，2 月前
查看次数：	162 次
最近记录：	7 年，1 月前