use*_*915 3 python dataframe pandas
我有一个数据框,在一列中我有一个像字符串一样存储的哈希值列表:
'[d85235f50b3c019ad7c6291e3ca58093,03e0fb034f2cb3264234b9eae09b4287]' just to be clear.
Run Code Online (Sandbox Code Playgroud)
数据框看起来像
1
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990a0904bbe6afec636c4213, c00a98ceb6fc7006d572486787e551cc, 0e72ae6851c40799ec14a41496d64406, 76475992f4207ee2b209a4867b42c372]
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2cb3264234b9eae09b4287]
Run Code Online (Sandbox Code Playgroud)
我想在此列中创建一个唯一的哈希id列表.
有效的方法是什么?谢谢
选项1
请参阅下面的时间以获得最快的选
您可以在一个理解中嵌入解析和展平
[y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
Run Code Online (Sandbox Code Playgroud)
从那里,你可以使用list(set())
,pd.unique
或np.unique
pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
array(['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
选项2
为简洁起见,请使用pd.Series.extractall
list(set(df['1'].str.extractall('(\w+)')[0]))
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
Run Code Online (Sandbox Code Playgroud)
@ jezrael的list(set())
理解是最快的
解析时间为了比较解析和展平,
我保持相同list(set())
.
%timeit list(set(np.concatenate(df['1'].apply(yaml.load).values).tolist()))
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
%timeit list(set(chain.from_iterable(df['1'].str.strip('[]').str.split(', '))))
%timeit list(set(df['1'].str.extractall('(\w+)')[0]))
1.01 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.42 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
279 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
941 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)
这需要我的理解,并使用各种方式使比较这些速度独特
%timeit pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit np.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
57.8 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.5 µs ± 552 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.18 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
74 次 |
最近记录: |