使用字符串和浮点数字典的 Pandas DataFrame 分配错误？

Question

使用字符串和浮点数字典的 Pandas DataFrame 分配错误？

Tha*_*Guy 10 python dictionary pandas .loc

问题

Pandas 似乎支持使用df.loc将字典分配给行条目，如下所示：

df = pd.DataFrame(columns = ['a','b','c'])
entry = {'a':'test', 'b':1, 'c':float(2)}
df.loc[0] = entry

Run Code Online (Sandbox Code Playgroud)

正如预期的那样，Pandas 根据字典键将字典值插入到相应的列中。打印这个给出：

      a  b    c
0  test  1  2.0

Run Code Online (Sandbox Code Playgroud)

但是，如果您覆盖相同的条目，Pandas 将分配字典键而不是字典值。打印这个给出：

   a  b  c
0  a  b  c

Run Code Online (Sandbox Code Playgroud)

题

为什么会发生这种情况？

具体来说，为什么这只发生在第二个任务中？所有后续分配都恢复为原始结果，包含（几乎）预期值：

      a  b  c
0  test  1  2

Run Code Online (Sandbox Code Playgroud)

我说几乎是因为dtypeonc实际上是所有后续结果的object替代float。

我已经确定，只要涉及字符串和浮点数，就会发生这种情况。如果它只是一个字符串和整数，或者整数和浮点数，你就不会发现这种行为。

示例代码

df = pd.DataFrame(columns = ['a','b','c'])
print(f'empty df:\n{df}\n\n')

entry = {'a':'test', 'b':1, 'c':float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')

df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after second entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after third entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after fourth entry:\n{df}\n\n')

Run Code Online (Sandbox Code Playgroud)

这给出了以下打印输出：

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
{'a': 'test', 'b': 1, 'c': float(2)}


df after entry:
      a  b    c
0  test  1  2.0


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
      a  b  c
0  test  1  2


df after fourth entry:
      a  b  c
0  test  1  2

Run Code Online (Sandbox Code Playgroud)

Answer 1

Hen*_*ker 8

1.2.4行为如下：

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}


df after entry:
      a  b    c
0  test  1  2.3


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
   a  b  c
0  a  b  c


df after fourth entry:
   a  b  c
0  a  b  c

Run Code Online (Sandbox Code Playgroud)

该df.loc[0]函数是第一次_setitem_with_indexer_missing运行，因为0轴上没有索引：

这一行运行：

elif isinstance(value, dict):
    value = Series(
        value, index=self.obj.columns, name=indexer, dtype=object
    )

Run Code Online (Sandbox Code Playgroud)

这将dict变成一个系列，它的行为符合预期。

但是，在将来，由于索引没有丢失（存在索引0），_setitem_with_indexer_split_path将运行：

elif len(ilocs) == len(value):
    # We are setting multiple columns in a single row.
    for loc, v in zip(ilocs, value):
        self._setitem_single_column(loc, v, pi)

Run Code Online (Sandbox Code Playgroud)

这只是使用以下每个值压缩列位置dict：

在这种情况下，这大致相当于：

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(list(zip([0, 1, 2], entry)))
# [(0, 'a'), (1, 'b'), (2, 'c')]

Run Code Online (Sandbox Code Playgroud)

因此，为什么值是键。

因此，问题并不像看起来那么具体：

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
print(f'df:\n{df}\n\n')

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')

df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')

Run Code Online (Sandbox Code Playgroud)

initial df:
   a  b  c
0  1  2  3

dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}

df after entry:
   a  b  c
0  a  b  c

Run Code Online (Sandbox Code Playgroud)

如果索引 loc 存在，它不会转换为系列：它只是用可迭代的列 locs 压缩。在字典的情况下，这意味着键是包含在框架中的值。

这也可能是为什么只有迭代器返回其值的可迭代对象才是可接受的loc赋值左侧参数的原因。

我也同意@DeepSpace这应该作为一个错误提出。

1.1.5 行为如下：

初始分配与 1.2.4 相同，但是：

dtypes 在这里值得注意：

import pandas as pd

df = pd.DataFrame({0: [1, 2, 3]}, columns=['a', 'b', 'c'])

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}

# First Entry
df.loc[0] = entry
print(df.dtypes)
# a     object
# b     object
# c    float64
# dtype: object

# Second Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Third Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Fourth Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

Run Code Online (Sandbox Code Playgroud)

他们引人注目的原因是因为当

take_split_path = self.obj._is_mixed_type

Run Code Online (Sandbox Code Playgroud)

是真的。它执行与 1.2.4 中相同的 zip 操作。

然而，在 1.1.5 中，dtypes are all objectsotake_split_path只有在第一次赋值之后才为 false，因为cis float64。后续分配使用：

if isinstance(value, (ABCSeries, dict)):
    # TODO(EA): ExtensionBlock.setitem this causes issues with
    # setting for extensionarrays that store dicts. Need to decide
    # if it's worth supporting that.
    value = self._align_series(indexer, Series(value))

Run Code Online (Sandbox Code Playgroud)

这自然是dict正确对齐的。

归档时间：	4 年，9 月前
查看次数：	261 次
最近记录：	4 年，9 月前