为了证实我明白大熊猫df.groupby()
和df.reset_index()
做什么,我试图做从数据帧的往返相同的数据和背部的分组版本.在往返之后,列和行必须再次排序,因为groupby()
影响行顺序并reset_index()
影响列顺序,但经过两次快速操作将列和索引按顺序排列后,数据帧看起来相同:
然而,在所有这些检查成功后,df1.equals(df5)
返回惊人的价值False
.
这些数据帧之间有什么区别,equals()
发现我还没弄明白如何检查自己?
测试代码:
csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""
import pandas as pd
df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)
df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)
print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))
Run Code Online (Sandbox Code Playgroud)
我运行脚本时收到的输出是:
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')
title object
year int64
director object
dtype: object
title object
year int64
director object
dtype: object
title year director
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
[ True True True True True]
False
Run Code Online (Sandbox Code Playgroud)
谢谢你的帮助!
这对我来说感觉像个错误,但可能只是因为我误解了一些东西.块以不同的顺序列出:
>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
Run Code Online (Sandbox Code Playgroud)
在core/internals.py
,我们有BlockManager
方法
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
return False
self._consolidate_inplace()
other._consolidate_inplace()
return all(block.equals(oblock) for block, oblock in
zip(self.blocks, other.blocks))
Run Code Online (Sandbox Code Playgroud)
并且最后all
假定块中self
和other
对应.但如果我们print
在它之前添加一些调用,我们会看到:
>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False
Run Code Online (Sandbox Code Playgroud)
所以我们比较错误的东西.我不知道这是否是一个错误的原因是因为我不知道是否equals
是意味着是这个挑剔与否.如果是这样,我认为至少有一个doc bug,因为equals
应该大声说它并不意味着用于你可能会认为它来自名称和文档字符串.
归档时间: |
|
查看次数: |
1677 次 |
最近记录: |