Bos*_*sam 3 python dataframe pandas
假设我有一个使用pandas.dataframe的列,如下所示:
id available_fruits
1 ['apple', 'banana']
1 []
2 ['apple', 'tomato']
1 ['banana']
2 ['kiwi']
Run Code Online (Sandbox Code Playgroud)
我想创建all_available_fruits没有重复的列表,这应该是['apple', 'banana', 'kiwi', 'tomato'].
换句话说,我想在pandas.dataframe列中添加列表中的所有元素.我怎样才能做到这一点?
使用numpy.concatenate了flatennig然后numpy.unique:
a = np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
print(a)
['apple', 'banana', 'kiwi', 'tomato']
Run Code Online (Sandbox Code Playgroud)
另一种解决方案是扁平化chain.from_iterable,获得独特性set并最后转换为list:
from itertools import chain
a = list(set(chain.from_iterable(df.available_fruits.values.tolist())))
print(a)
['tomato', 'kiwi', 'apple', 'banana']
Run Code Online (Sandbox Code Playgroud)
时间:
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
In [62]: %timeit list(set(concat(df.available_fruits.values.tolist())))
100 loops, best of 3: 3.16 ms per loop
In [63]: %timeit np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
10 loops, best of 3: 99.2 ms per loop
#John Galt's solution
In [64]: %timeit list(set(df.available_fruits.sum()))
1 loop, best of 3: 4.12 s per loop
#pir's solution 0
In [65]: %timeit list(set(concat(df.available_fruits.values.tolist())))
100 loops, best of 3: 3.16 ms per loop
#pir's solution 1
In [66]: %timeit list({k: 1 for x in df.available_fruits.values.tolist() for k in x})
100 loops, best of 3: 4.59 ms per loop
#pir's solution 2
In [67]: %%timeit
...: from sklearn.preprocessing import MultiLabelBinarizer
...:
...: mlb = MultiLabelBinarizer()
...: mlb.fit(df.available_fruits)
...: list(mlb.classes_)
...:
100 loops, best of 3: 4.07 ms per loop
#perigon's solution
In [68]: %timeit list(set([val for lst in df.available_fruits for val in lst]))
100 loops, best of 3: 5.1 ms per loop
Run Code Online (Sandbox Code Playgroud)
另一种方法,使用列表连接和set,sum在列表上加入他们.
In [779]: list(set(df.available_fruits.sum()))
Out[779]: ['tomato', 'kiwi', 'apple', 'banana']
Run Code Online (Sandbox Code Playgroud)
但是,使用chain.from_iterable来自@jezrael或@ perigon的扁平列表方法的方法.
选项0
from cytoolz import concat
list(set(concat(df.available_fruits.values.tolist())))
Run Code Online (Sandbox Code Playgroud)
选项1
list({k: 1 for x in df.available_fruits.values.tolist() for k in x})
['apple', 'banana', 'tomato', 'kiwi']
Run Code Online (Sandbox Code Playgroud)
选项2
从左侧字段...
from sklearn.preprocessing import MultiLabelBinarizer
MultiLabelBinarizer().fit(df.available_fruits).classes_.tolist()
['apple', 'banana', 'kiwi', 'tomato']
Run Code Online (Sandbox Code Playgroud)
时间安排
:
pir1 和 jez2pir2 非常接近 jez2results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))
pir0 pir1 pir2 galt jez1 jez2 prgn Best
N
1 2.36 1.00 4.43 13.93 10.82 1.00 2.86 pir1
3 1.67 1.51 3.94 12.27 7.20 1.00 2.73 jez2
10 1.59 1.09 4.90 9.90 9.24 1.00 3.03 jez2
30 1.20 1.39 2.44 6.78 9.42 1.00 2.67 jez2
100 1.06 1.45 1.66 12.15 20.50 1.00 2.00 jez2
300 1.13 1.76 1.33 28.30 33.41 1.00 2.01 jez2
1000 1.00 1.70 1.11 111.74 32.79 1.18 1.95 pir0
3000 1.00 1.93 1.02 364.07 32.18 1.03 2.02 pir0
10000 1.08 1.87 1.00 1223.63 35.10 1.03 1.97 pir2
Run Code Online (Sandbox Code Playgroud)
码
pir0 = lambda df: list(set(concat(df.available_fruits.values.tolist())))
pir1 = lambda df: list({k: 1 for x in df.available_fruits.values.tolist() for k in x})
pir2 = lambda df: MultiLabelBinarizer().fit(df.available_fruits).classes_.tolist()
galt = lambda df: list(set(df.available_fruits.sum()))
jez1 = lambda df: np.unique(np.concatenate(df['available_fruits'].values.tolist())).tolist()
jez2 = lambda df: list(set(chain.from_iterable(df.available_fruits.values.tolist())))
prgn = lambda df: list(set([val for lst in df.available_fruits for val in lst]))
results = pd.DataFrame(
index=pd.Index([1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000], name='N'),
columns='pir0 pir1 pir2 galt jez1 jez2 prgn'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
fig, (a1, a2) = plt.subplots(1, 2, figsize=(10, 10))
results.plot(loglog=True, ax=a1)
results.div(results.min(1), 0).round(2).plot.barh(logx=True, ax=a2)
Run Code Online (Sandbox Code Playgroud)