MrF*_*pes 5 python types pandas
Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
Run Code Online (Sandbox Code Playgroud)
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
I could iteratively derive a mask and use it in loc, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
Run Code Online (Sandbox Code Playgroud)
That does the trick but I was wondering if there was a more pandastic way of doing this?
一种想法是通过to_numeric使用errors=\'coerce\'和非缺失值来测试是否为数字:
m = pd.to_numeric(df[\'mixed\'], errors=\'coerce\').notna()\ndf.loc[m, \'mixed\'] *= 10\nprint (df)\n mixed\n0 2020-10-04 00:00:00\n1 9990\n2 a string\nRun Code Online (Sandbox Code Playgroud)\n不幸的是速度很慢,还有一些其他想法:
\nN = 1000000\ndf = pd.DataFrame({\'mixed\': [pd.Timestamp(\'2020-10-04\'), 999, \'a string\'] * N})\n\n\nIn [29]: %timeit df.mixed.map(lambda x : type(x).__name__)==\'int\'\n1.26 s \xc2\xb1 83.8 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [30]: %timeit np.array([isinstance(v, int) for v in df[\'mixed\']])\n1.12 s \xc2\xb1 77.9 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [31]: %timeit pd.to_numeric(df[\'mixed\'], errors=\'coerce\').notna()\n3.07 s \xc2\xb1 55.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\nIn [34]: %timeit ([isinstance(v, int) for v in df[\'mixed\']])\n909 ms \xc2\xb1 8.45 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [35]: %timeit df.mixed.map(lambda x : type(x))==\'int\'\n877 ms \xc2\xb1 8.69 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [36]: %timeit df.mixed.map(lambda x : type(x) ==\'int\')\n842 ms \xc2\xb1 6.29 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))\n807 ms \xc2\xb1 13.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n默认情况下,Pandas 不能有效地使用矢量化,因为混合值 - 所以是必要的元素方法。
\n