Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

Question

Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})

Run Code Online (Sandbox Code Playgroud)

I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.

I could iteratively derive a mask and use it in loc, like

m = np.array([isinstance(v, int) for v in df['mixed']])

df.loc[m, 'mixed'] *= 10

# df
#                  mixed
# 0  2020-10-04 00:00:00
# 1                 9990
# 2             a string

Run Code Online (Sandbox Code Playgroud)

That does the trick but I was wondering if there was a more pandastic way of doing this?

Answer 1

jez*_*ael 4

一种想法是通过to_numeric使用errors=\'coerce\'和非缺失值来测试是否为数字：

\n

m = pd.to_numeric(df[\'mixed\'], errors=\'coerce\').notna()\ndf.loc[m, \'mixed\'] *= 10\nprint (df)\n                 mixed\n0  2020-10-04 00:00:00\n1                 9990\n2             a string\n

Run Code Online (Sandbox Code Playgroud)\n

不幸的是速度很慢，还有一些其他想法：

\n

N = 1000000\ndf = pd.DataFrame({\'mixed\': [pd.Timestamp(\'2020-10-04\'), 999, \'a string\'] * N})\n\n\nIn [29]: %timeit df.mixed.map(lambda x : type(x).__name__)==\'int\'\n1.26 s \xc2\xb1 83.8 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [30]: %timeit np.array([isinstance(v, int) for v in df[\'mixed\']])\n1.12 s \xc2\xb1 77.9 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [31]: %timeit pd.to_numeric(df[\'mixed\'], errors=\'coerce\').notna()\n3.07 s \xc2\xb1 55.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n

Run Code Online (Sandbox Code Playgroud)\n

\n

In [34]: %timeit ([isinstance(v, int) for v in df[\'mixed\']])\n909 ms \xc2\xb1 8.45 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [35]: %timeit df.mixed.map(lambda x : type(x))==\'int\'\n877 ms \xc2\xb1 8.69 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [36]: %timeit df.mixed.map(lambda x : type(x) ==\'int\')\n842 ms \xc2\xb1 6.29 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))\n807 ms \xc2\xb1 13.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n

Run Code Online (Sandbox Code Playgroud)\n

默认情况下，Pandas 不能有效地使用矢量化，因为混合值 - 所以是必要的元素方法。

\n

归档时间：	5 年前
查看次数：	272 次
最近记录：	5 年前