我想从字符串长度> 4的Dataframe列中的字符串中删除前3个字符
否则,它们应保持不变。
例如
bloomberg_ticker_y
AIM9
DJEM9 # (should be M9)
FAM9
IXPM9 # (should be M9)
Run Code Online (Sandbox Code Playgroud)
我可以按长度过滤字符串:
merged['bloomberg_ticker_y'].str.len() > 4
Run Code Online (Sandbox Code Playgroud)
并切片字符串:
merged['bloomberg_ticker_y'].str[-2:]
Run Code Online (Sandbox Code Playgroud)
但不确定如何将它们放在一起并将其应用于我的数据框
任何帮助,将不胜感激。
您可以使用列表推导:
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df['new'] = [x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]
Run Code Online (Sandbox Code Playgroud)
输出:
bloomberg_ticker_y new
0 AIM9 AIM9
1 DJEM9 M9
2 FAM9 FAM9
3 IXPM9 M9
Run Code Online (Sandbox Code Playgroud)
您可以使用numpy.where该条件来根据字符串长度选择切片。
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y'])
# array(['AIM9', 'M9', 'FAM9', 'M9'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
df['bloomberg_ticker_sliced'] = (
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y']))
df
bloomberg_ticker_y bloomberg_ticker_sliced
0 AIM9 AIM9
1 DJEM9 M9
2 FAM9 FAM9
3 IXPM9 M9
Run Code Online (Sandbox Code Playgroud)
如果您喜欢基于向量化map的解决方案,那是
df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)
0 AIM9
1 M9
2 FAM9
3 M9
Name: bloomberg_ticker_y, dtype: object
Run Code Online (Sandbox Code Playgroud)
看到了各种各样的答案,因此决定在速度方面进行比较:
# Create big size test dataframe
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df = pd.concat([df]*100000)
df.shape
#Out
(400000, 1)
Run Code Online (Sandbox Code Playgroud)
CS95#1 np.where
%%timeit
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y'])
Run Code Online (Sandbox Code Playgroud)
结果:
163 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)
CS95 #2 vectorized map based solution
%%timeit
df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)
Run Code Online (Sandbox Code Playgroud)
Result:
86 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)
Yatu DataFrame.mask
%%timeit
df.bloomberg_ticker_y.mask(df.bloomberg_ticker_y.str.len().gt(4),
other=df.bloomberg_ticker_y.str[-2:])
Run Code Online (Sandbox Code Playgroud)
Result:
187 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Vlemaistre list comprehension
%%timeit
[x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]
Run Code Online (Sandbox Code Playgroud)
Result:
84.8 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)
pault str.replace with regex
%%timeit
df["bloomberg_ticker_y"].str.replace(r".{3,}(?=.{2}$)", "")
Run Code Online (Sandbox Code Playgroud)
Result:
324 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Cobra DataFrame.apply
%%timeit
df.apply(lambda x: (x['bloomberg_ticker_y'][3:] if len(x['bloomberg_ticker_y']) > 4 else x['bloomberg_ticker_y']) , axis=1)
Run Code Online (Sandbox Code Playgroud)
Result:
6.83 s ± 387 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
Conclusion
Fastest method is list comprehension closely followed by vectorized map based solution.
Slowest method is DataFrame.apply by far (as expected) followed by str.replace with regex
| 归档时间: |
|
| 查看次数: |
142 次 |
| 最近记录: |