Python pandas:添加特定列中的元素列表以查找all_elements

Bos*_*sam 3 python dataframe pandas

假设我有一个使用pandas.dataframe的列,如下所示:

id  available_fruits  
1   ['apple', 'banana']   
1   []
2   ['apple', 'tomato']
1   ['banana']
2   ['kiwi']
Run Code Online (Sandbox Code Playgroud)

我想创建all_available_fruits没有重复的列表,这应该是['apple', 'banana', 'kiwi', 'tomato'].

换句话说,我想在pandas.dataframe列中添加列表中的所有元素.我怎样才能做到这一点?

jez*_*ael 6

使用numpy.concatenate了flatennig然后numpy.unique:

a = np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
print(a)

['apple', 'banana', 'kiwi', 'tomato']
Run Code Online (Sandbox Code Playgroud)

另一种解决方案是扁平化chain.from_iterable,获得独特性set并最后转换为list:

from  itertools import chain
a = list(set(chain.from_iterable(df.available_fruits.values.tolist())))
print(a)
['tomato', 'kiwi', 'apple', 'banana']
Run Code Online (Sandbox Code Playgroud)

时间:

df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

In [62]: %timeit list(set(concat(df.available_fruits.values.tolist())))
100 loops, best of 3: 3.16 ms per loop

In [63]: %timeit np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
10 loops, best of 3: 99.2 ms per loop

#John Galt's solution
In [64]: %timeit list(set(df.available_fruits.sum()))
1 loop, best of 3: 4.12 s per loop

#pir's solution 0
In [65]: %timeit list(set(concat(df.available_fruits.values.tolist())))
100 loops, best of 3: 3.16 ms per loop

#pir's solution 1
In [66]: %timeit list({k: 1 for x in df.available_fruits.values.tolist() for k in x})
100 loops, best of 3: 4.59 ms per loop

#pir's solution 2
In [67]: %%timeit
    ...: from sklearn.preprocessing import MultiLabelBinarizer
    ...: 
    ...: mlb = MultiLabelBinarizer()
    ...: mlb.fit(df.available_fruits)
    ...: list(mlb.classes_)
    ...: 
100 loops, best of 3: 4.07 ms per loop

#perigon's solution
In [68]: %timeit list(set([val for lst in df.available_fruits for val in lst]))
100 loops, best of 3: 5.1 ms per loop
Run Code Online (Sandbox Code Playgroud)

  • 当我看到'dataframe`标签时,我不敢回答,因为我知道没有人可以击败jezrael熊猫,我很钦佩你的哥们,从你身上学到很多,谢谢:) +1 (4认同)

Zer*_*ero 5

另一种方法,使用列表连接和set,sum在列表上加入他们.

In [779]: list(set(df.available_fruits.sum()))
Out[779]: ['tomato', 'kiwi', 'apple', 'banana']
Run Code Online (Sandbox Code Playgroud)

但是,使用chain.from_iterable来自@jezrael或@ perigon的扁平列表方法的方法.


piR*_*red 5

选项0

from cytoolz import concat

list(set(concat(df.available_fruits.values.tolist())))
Run Code Online (Sandbox Code Playgroud)

选项1

list({k: 1 for x in df.available_fruits.values.tolist() for k in x})

['apple', 'banana', 'tomato', 'kiwi']
Run Code Online (Sandbox Code Playgroud)

选项2
从左侧字段...

from sklearn.preprocessing import MultiLabelBinarizer

MultiLabelBinarizer().fit(df.available_fruits).classes_.tolist()

['apple', 'banana', 'kiwi', 'tomato']
Run Code Online (Sandbox Code Playgroud)

时间安排
:

  • 最快的小数据:
    • pir1jez2
  • 最快的大数据
    • pir2 非常接近 jez2

results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))

       pir0  pir1  pir2     galt   jez1  jez2  prgn  Best
N                                                        
1      2.36  1.00  4.43    13.93  10.82  1.00  2.86  pir1
3      1.67  1.51  3.94    12.27   7.20  1.00  2.73  jez2
10     1.59  1.09  4.90     9.90   9.24  1.00  3.03  jez2
30     1.20  1.39  2.44     6.78   9.42  1.00  2.67  jez2
100    1.06  1.45  1.66    12.15  20.50  1.00  2.00  jez2
300    1.13  1.76  1.33    28.30  33.41  1.00  2.01  jez2
1000   1.00  1.70  1.11   111.74  32.79  1.18  1.95  pir0
3000   1.00  1.93  1.02   364.07  32.18  1.03  2.02  pir0
10000  1.08  1.87  1.00  1223.63  35.10  1.03  1.97  pir2
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

pir0 = lambda df: list(set(concat(df.available_fruits.values.tolist())))
pir1 = lambda df: list({k: 1 for x in df.available_fruits.values.tolist() for k in x})
pir2 = lambda df: MultiLabelBinarizer().fit(df.available_fruits).classes_.tolist()
galt = lambda df: list(set(df.available_fruits.sum()))
jez1 = lambda df: np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
jez2 = lambda df: list(set(chain.from_iterable(df.available_fruits.values.tolist())))
prgn = lambda df: list(set([val for lst in df.available_fruits for val in lst]))

results = pd.DataFrame(
    index=pd.Index([1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000], name='N'),
    columns='pir0 pir1 pir2 galt jez1 jez2 prgn'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

fig, (a1, a2) = plt.subplots(1, 2, figsize=(10, 10))
results.plot(loglog=True, ax=a1)
results.div(results.min(1), 0).round(2).plot.barh(logx=True, ax=a2)
Run Code Online (Sandbox Code Playgroud)