使用 pandas read_json 的列 dtype

Question

使用 pandas read_json 的列 dtype

我有一个 json 文件，如下所示：

[{"A": 0, "B": "x"}, {"A": 1, "B": "y", "C": 0}, {"A": 2, "B": "z", "C": 1}]

Run Code Online (Sandbox Code Playgroud)

由于“C”列包含 NaN 值（第一行），pandas 自动推断其 dtype 为“float64”：

>>> pd.read_json(path).C.dtype
dtype('float64')

Run Code Online (Sandbox Code Playgroud)

但是，我希望“C”列的数据类型为“Int32”。pd.read_json(path, dtype={"C": "Int32"})不起作用：

>>> pd.read_json(path, dtype={"C": "Int32"}).C.dtype
dtype('float64')

Run Code Online (Sandbox Code Playgroud)

相反，pd.read_json(path).astype({"C": "Int32"})确实有效：

>>> pd.read_json(path).astype({"C": "Int32"}).C.dtype
Int32Dtype()

Run Code Online (Sandbox Code Playgroud)

为什么会出现这种情况？如何仅使用该pd.read_json函数设置正确的数据类型？

Answer 1

Ste*_*tef 5

原因在这段代码部分：

        dtype = (
            self.dtype.get(name) if isinstance(self.dtype, dict) else self.dtype
        )
        if dtype is not None:
            try:
                dtype = np.dtype(dtype)
                return data.astype(dtype), True
            except (TypeError, ValueError):
                return data, False

Run Code Online (Sandbox Code Playgroud)

'Int32'当尝试将整个列（数组）转换为此类型时，它numpy.int32会转换为该类型，然后导致值错误（无法将非有限值（NA 或 inf）转换为整数）。因此，原始（未转换的）数据将在异常块中返回。
我猜这是 pandas 中的某种错误，至少该行为没有正确记录。

astype另一方面，工作方式不同：它按元素应用于 'astype'系列），因此可以创建混合类型列。

有趣的是，当直接指定扩展类型 pd.Int32Dtype()（而不是其字符串别名'Int32'）时，您乍一看会得到所需的结果，但如果您随后查看类型，它们仍然是浮点数：

df = pd.read_json(json, dtype={"C": pd.Int32Dtype})
print(df)
#   A  B    C
#0  0  x  NaN
#1  1  y    0
#2  2  z    1
print(df.C.map(type))
#0    <class 'float'>
#1    <class 'float'>
#2    <class 'float'>
#Name: C, dtype: object

Run Code Online (Sandbox Code Playgroud)

用于比较：

print(df.C.astype('Int32').map(type))
#0    <class 'pandas._libs.missing.NAType'>
#1                            <class 'int'>
#2                            <class 'int'>
#Name: C, dtype: object

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年前
查看次数：	4132 次
最近记录：	6 年前