并排输出两个Pandas数据帧的差异 - 突出显示差异

sky*_*sky 138 html python panel dataframe pandas

我试图突出显示两个数据帧之间的确切变化.

假设我有两个Python Pandas数据帧:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation
Run Code Online (Sandbox Code Playgroud)

我的目标是输出一个HTML表:

  1. 标识已更改的行(可以是int,float,boolean,string)
  2. 输出具有相同,OLD和NEW值的行(理想情况下输入到HTML表中),以便消费者可以清楚地看到两个数据帧之间发生了哪些变化:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"
    
    Run Code Online (Sandbox Code Playgroud)

我想我可以逐行和逐列比较,但有更简单的方法吗?

And*_*den 135

第一部分类似于Constantine,你可以得到哪些行为空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool
Run Code Online (Sandbox Code Playgroud)

然后我们可以看到哪些条目已更改:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool
Run Code Online (Sandbox Code Playgroud)

这里第一个条目是索引,第二个条目是已更改的列.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation
Run Code Online (Sandbox Code Playgroud)

*注:这一点很重要df1,并df2在这里分享相同的索引.为了克服这种歧义,您可以确保只使用共享标签df1.index & df2.index,但我想我会将其作为练习.

  • 如果在df1和df1中得分等于"nan",则此函数将报告它已从"nan"更改为"nan".这是因为`np.nan!= np.nan`返回'True`. (10认同)
  • 我相信"共享相同的索引"意味着"确保索引已排序"...这将比较`df1`中的第一个与`df2`中的第一个,无论索引的值如何.JFYI,以防我不是唯一一个不明显的人.;谢谢! (2认同)
  • @kungfujam是对的.此外,如果要比较的值为None,那么您也会得到错误的差异 (2认同)

Ted*_*rou 71

突出显示两个DataFrame之间的差异

可以使用DataFrame样式属性突出显示存在差异的单元格的背景颜色.

使用原始问题的示例数据

第一步是将DataFrames与concat函数水平连接,并使用keys参数区分每个帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

交换列级别并将相同的列名称放在一起可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

现在,更容易发现帧中的差异.但是,我们可以进一步使用该style属性来突出显示不同的单元格.我们定义了一个自定义函数来执行此操作,您可以在本文档的这一部分中看到.

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

这将突出显示两个都缺少值的单元格.您可以填充它们或提供额外的逻辑,以便它们不会突出显示.

  • 你知道是否可以将“第一”和“第二”都涂成不同的颜色吗? (2认同)
  • 比较具有26K行和400列的数据帧时,此实现将花费更长的时间。有什么办法可以加快速度吗? (2认同)

Jam*_*ers 38

这个答案简单地扩展了@Andy Hayden,使其在数字字段时具有弹性nan,并将其包装到函数中.

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)
Run Code Online (Sandbox Code Playgroud)

因此,使用您的数据(稍微编辑以在分数列中包含NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)
Run Code Online (Sandbox Code Playgroud)

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation
Run Code Online (Sandbox Code Playgroud)

  • 当索引标签与“ValueError:只能比较相同标签的 DataFrame 对象”不同时,此函数会崩溃 - 如果它也考虑到这种情况,此函数将会更强或更强大。 (3认同)

小智 18

我遇到过这个问题,但在找到这篇文章之前找到了答案:

根据unutbu的答案,加载您的数据......

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])
Run Code Online (Sandbox Code Playgroud)

...定义你的差异功能......

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
Run Code Online (Sandbox Code Playgroud)

然后你可以简单地使用Panel得出结论:

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00
Run Code Online (Sandbox Code Playgroud)

顺便说一句,如果您在IPython Notebook中,您可能希望使用彩色diff函数来根据单元格是否不同,相等或左/右null来给出颜色:

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))
Run Code Online (Sandbox Code Playgroud)

  • 面板已弃用!知道怎么移植这个吗? (5认同)

unu*_*tbu 17

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)
Run Code Online (Sandbox Code Playgroud)

版画

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation
Run Code Online (Sandbox Code Playgroud)

  • 非常好的解决方案,比我的更紧凑! (3认同)
  • @AndyHayden:我对这个解决方案并不完全满意;它似乎只有在索引是多级索引时才有效。如果我尝试只使用 `id` 作为索引,那么 `df.groupby(level='id')` 会引发错误,我不知道为什么...... (2认同)

cs9*_*s95 12

熊猫 >= 1.1: DataFrame.compare

使用 pandas 1.1,您基本上可以通过单个函数调用来复制 Ted Petrou 的输出。从文档中获取的示例:

pd.__version__
# '1.1.0'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation
Run Code Online (Sandbox Code Playgroud)

这里,“self”指的是 LHS 数据帧,而“other”是指 RHS 数据帧。默认情况下,相等的值会替换为 NaN,因此您可以只关注差异。如果您还想显示相等的值,请使用

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation
Run Code Online (Sandbox Code Playgroud)

您还可以使用align_axis以下方法更改比较轴:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation
Run Code Online (Sandbox Code Playgroud)

这将按行而不是按列比较值。


cge*_*cge 8

如果您的两个数据帧中包含相同的ID,那么找出更改的内容实际上非常简单.刚刚做的frame1 != frame2将给你一个布尔True数据框架,其中每个都是已经改变的数据.从那里,您可以轻松地获取每个已更改行的索引changedids = frame1.index[np.any(frame1 != frame2,axis=1)].


jur*_*jur 6

使用concat和drop_duplicates的另一种方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)
Run Code Online (Sandbox Code Playgroud)

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation
Run Code Online (Sandbox Code Playgroud)


Aar*_*ock 6

在摆弄了 @journois 的答案后,由于Panel 的 deprication ,我能够使用 MultiIndex 而不是 Panel 来让它工作。

首先,创建一些虚拟数据:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})
Run Code Online (Sandbox Code Playgroud)

然后,定义您的diff函数,在这种情况下,我将使用他的答案中的函数report_diff保持不变:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
Run Code Online (Sandbox Code Playgroud)

然后,我将把数据连接到 MultiIndex 数据帧中:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]
Run Code Online (Sandbox Code Playgroud)

最后,我将应用report_diff每个列组:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
Run Code Online (Sandbox Code Playgroud)

这输出:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6
Run Code Online (Sandbox Code Playgroud)

仅此而已!