熊猫合并101

cs9*_*s95 271 python merge join pandas

  • 如何用pandas 执行(LEFT| RIGHT| FULL)(INNER| OUTER)连接?
  • 合并后如何为缺失的行添加NaN?
  • 合并后如何摆脱NaN?
  • 我可以合并索引吗?
  • 如何合并多个DataFrame?
  • mergejoinconcatupdate?谁?什么?为什么?!

... 和更多.我已经看到了这些反复出现的问题,询问了pandas合并功能的各个方面.今天关于合并及其各种用例的大部分信息在几十个措辞严厉,不可搜索的帖子中都是分散的.这里的目的是为后代整理一些更重要的观点.

这个QnA应该是关于常见熊猫习语的一系列有用的用户指南的下一部分(参见关于转动的这篇文章,以及关于连接的这篇文章,我将在稍后介绍).

请注意,这篇文章并不是文档的替代品,所以请阅读它!一些例子来自那里.

cs9*_*s95 375

这篇文章旨在为读者提供关于SQL风格的大熊猫合并,如何使用以及何时不使用它的入门知识.

特别是,这篇文章将通过以下内容:

  • 基础知识 - 连接类型(LEFT,RIGHT,OUTER,INNER)

    • 合并不同的列名称
    • 避免输出中出现重复的合并键列
  • 在不同条件下与指数合并
    • 有效地使用你的命名索引
    • 合并键作为另一个的列和列的索引
  • Multiway合并列和索引(唯一和非唯一)
  • 着名的替代品mergejoin

这篇文章不会经历的内容:

  • 与绩效相关的讨论和时间安排(目前).在合适的情况下,最值得注意的是提到更好的替代品.
  • 处理后缀,删除额外的列,重命名输出和其他特定用例.还有其他(阅读:更好)帖子处理,所以搞清楚!

注意
除非另有说明,否则大多数示例默认为INNER JOIN操作,同时演示各种功能.

此外,可以复制和复制此处的所有DataFrame,以便您可以使用它们.另外,请参阅此文章 ,了解如何从剪贴板中读取DataFrame.

最后,通过文章https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins借鉴了JOIN操作的所有可视化表示 .

足够的谈话,只是告诉我如何使用merge!

建立

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357
Run Code Online (Sandbox Code Playgroud)

为简单起见,键列具有相同的名称(暂时).

一个内连接由下式表示

在此输入图像描述

注意
pd.merge这里指的是从连接列键left数据帧,right 是指从联接列键merge数据框,并且交集代表共同向这两个键DataFrame.mergehow='left'.阴影区域表示JOIN结果中存在的键.整个过程都将遵循这一惯例.请记住,维恩图并不是JOIN操作的100%准确表示,因此请用一点盐来处理它们.

要执行INNER JOIN,请调用how='left'指定左侧DataFrame,右侧DataFrame和连接键.

pd.merge(left, right, on='key')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

这仅返回来自leftright共享公共密钥的行(在此示例中为"B"和"D").

在更新版本的pandas(v0.21左右)中,how='right'现在是第一个订单功能,所以你可以调用right.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

LEFT OUTER JOIN,或LEFT JOIN由下式表示

在此输入图像描述

这可以通过指定来执行left.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

仔细注意NaNs的位置.如果指定how='outer',则仅使用密钥left,并且缺少的数据left将由NaN替换.

同样地,对于一个正确的外部联接,或者正确的联合......

在此输入图像描述

...指定keyLeft:

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357
Run Code Online (Sandbox Code Playgroud)

这里right使用密钥,并且缺少的数据keyRight由NaN替换.

最后,对于FULL OUTER JOIN,给出

在此输入图像描述

指定key.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
Run Code Online (Sandbox Code Playgroud)

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

在此输入图像描述

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

在此输入图像描述

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left_on only,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN
Run Code Online (Sandbox Code Playgroud)

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both
Run Code Online (Sandbox Code Playgroud)

And similarly, for a RIGHT-Excluding JOIN,

在此输入图像描述

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357
Run Code Online (Sandbox Code Playgroud)

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

在此输入图像描述

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357
Run Code Online (Sandbox Code Playgroud)

Different names for key columns

If the key columns are named differently—for example, right_on has on, and keyLeft has left instead of keyRight—then you will have to specify right and keyLeft as arguments instead of keyRight:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
Run Code Online (Sandbox Code Playgroud)

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278
Run Code Online (Sandbox Code Playgroud)

Avoiding duplicate key column in output

When merging on left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner') from keyLeft and DataFrames from map, if you only want either of the on or left_on (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278
Run Code Online (Sandbox Code Playgroud)

Contrast this with the output of the command just before (thst is, the output of right_on), you'll notice merge* is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the merge

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3
Run Code Online (Sandbox Code Playgroud)

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1
Run Code Online (Sandbox Code Playgroud)

If you're doing a LEFT OUTER JOIN, a more performant solution would involve DataFrame.update:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
Run Code Online (Sandbox Code Playgroud)

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
Run Code Online (Sandbox Code Playgroud)

Merging on multiple columns

To join on more than one column, specify a list for DataFrame.combine_first (or pd.merge_ordered and pd.merge_asof, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)
Run Code Online (Sandbox Code Playgroud)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Run Code Online (Sandbox Code Playgroud)

Other useful merge operations and functions

  • Merging a DataFrame with Series on index: See this answer.
  • Besides join, concat and merge are also used in certain cases to update one DataFrame with another.

  • on is a useful function for ordered JOINs.

  • left_on (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on right_on, right_on=..., and left_index=True as well as the links to the function specs.


Index-based*-JOIN (+ index-column lefts)

Setup

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076
Run Code Online (Sandbox Code Playgroud)

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

Support for index names

If your index is named, then v0.23 users can also specify the level name to left_on (or DataFrame.join and DataFrame.join as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

Merging on index of one, column(s) of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

left.merge(right, left_on='key1', right_index=True)
Run Code Online (Sandbox Code Playgroud)

Or vice versa (DataFrame.join and how='inner').

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135
Run Code Online (Sandbox Code Playgroud)

In this special case, the index for lsuffix is named, so you can also use the index name with rsuffix, like this:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135
Run Code Online (Sandbox Code Playgroud)

join
Besides these, there is another succinct option. You can use pd.concat which defaults to joins on the index. pd.concat does a LEFT OUTER JOIN by default, so join='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

Note that I needed to specify the pd.concat and merge arguments since merge would otherwise error out:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Run Code Online (Sandbox Code Playgroud)

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

pd.concat
Lastly, as an alternative for index-based joins, you can use DataFrame.join:

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

Omit pd.concat if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076
Run Code Online (Sandbox Code Playgroud)

For more information, see this canonical post on pd.concat by @piRSquared.


Generalizing: join='inner'ing multiple DataFrames

Setup

df1.merge(df2, ...).merge(df3, ...)
Run Code Online (Sandbox Code Playgroud)

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining join calls:

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]
Run Code Online (Sandbox Code Playgroud)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. To do this, one often used simple trick is with concat, and you can use it to achieve a INNER JOIN like so:

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0
Run Code Online (Sandbox Code Playgroud)

Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a join.

For a FULL OUTER JOIN, you can curry join using merge:

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
Run Code Online (Sandbox Code Playgroud)

您可能已经注意到,这非常强大 - 您还可以在合并期间使用它来控制列名.只需根据需要添加更多关键字参数:

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
Run Code Online (Sandbox Code Playgroud)

替代方案:merge
如果您的列值是唯一的,那么使用它是有意义的join,这比一次两次多路合并更快.

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0
Run Code Online (Sandbox Code Playgroud)

Multiway合并唯一索引

如果要在唯一索引上合并多个DataFrame,则应再次选择merge更好的性能.

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357
Run Code Online (Sandbox Code Playgroud)

pd.merge(left, right, on='key')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

与往常一样,省略pd.merge一个完整的外部联接.

Multiway合并索引与重复

left很快,但有其缺点.它无法处理重复.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
Run Code Online (Sandbox Code Playgroud)

在这种情况下,right是最好的选择,因为它可以处理非唯一索引(引擎盖下的merge调用DataFrame.merge).

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357
Run Code Online (Sandbox Code Playgroud)

  • 这是一个很棒的资源!我仍然有的唯一问题是为什么称其为合并而不是联接,以及联接而不是合并? (7认同)
  • 如果有人对每篇文章末尾的目录感到困惑,我将这个庞大的答案分成 4 个单独的答案,3 个关于这个问题,1 个关于另一个问题。之前的设置方式使得向人们推荐特定主题变得更加困难。这使您现在可以轻松地为单独的主题添加书签! (3认同)

Anu*_*dse 79

加入101

这些动画可能更能直观地向您解释。鸣谢:Garrick Aden-Buie tidyexplain repo

内部联接

在此输入图像描述

外连接或全连接

在此输入图像描述

右连接

在此输入图像描述

左连接

在此输入图像描述


eli*_*liu 33

一个补充的视觉观pd.concat([df0, df1], kwargs)。请注意,kwarg axis=0axis=1的含义不如df.mean()或直观df.apply(func)


在pd.concat([df0,df1])

  • 这是一个很好的图。请问您是如何生产的? (8认同)
  • google doc的内置“插入==>绘图... ==>新”(截至2019年5月)。但是,要明确一点:我为这张图片使用google doc的唯一原因是因为我的笔记存储在google doc中,并且我希望可以在google doc本身中快速修改图片。实际上,您现在已经提到了,Google文档的绘图工具非常简洁。 (6认同)
  • 是的,现在有“合并”、“连接”和轴等等。然而,正如 @eliu 所示,这都是与“左”和“右”以及“水平”或“垂直”*合并*相同的概念。就我个人而言,每次我必须记住哪个“轴”是“0”、哪个“轴”是“1”时,都必须查看文档。 (4认同)
  • 哇,这太棒了。来自 SQL 世界的“垂直”连接并不是我头脑中的连接,因为表的结构始终是固定的。现在甚至认为 pandas 应该将“concat”和“merge”合并,方向参数为“horizo​​ntal”或“vertical”。 (2认同)
  • @Ufos难道不就是`axis = 1`和`axis = 0`是什么吗? (2认同)
  • 如果可能的话,有人应该在`.mean() .apply() .dropna() .concat()`中解决`axis=0`和`axis=1`。我必须考虑很多才能针对每种情况做出决定。 (2认同)

Gon*_*ica 18

在这个答案中,我将考虑实际示例。

第一个,是pandas.concat

第二个,从一个的索引和另一个的列合并数据帧。


1 .pandas.concat

考虑以下DataFrames具有相同列名的内容:

Preco2018与大小 (8784, 5)

数据帧 1

Preco 2019尺寸 (8760, 5)

数据帧 2

具有相同的列名。

您可以使用pandas.concat, 通过简单地组合它们

import pandas as pd

frames = [Preco2018, Preco2019]

df_merged = pd.concat(frames)
Run Code Online (Sandbox Code Playgroud)

这会产生具有以下大小的 DataFrame (17544, 5)

两个数据帧组合的数据帧结果

如果你想可视化,它最终会像这样工作

concat 的工作原理

来源


2 . 按列和索引合并

在这一部分,我将考虑一个特定的情况:如果想要合并一个数据帧的索引和另一个数据帧的列。

假设有一个Geo包含 54 列的数据框Data,它是Date类型的列之一datetime64[ns]

在此处输入图片说明

并且Price具有价格和索引的一列的数据框对应于日期

在此处输入图片说明

在这种特定情况下,要合并它们,可以使用 pd.merge

merged = pd.merge(Price, Geo, left_index=True, right_on='Data')
Run Code Online (Sandbox Code Playgroud)

这导致以下数据帧

在此处输入图片说明


cs9*_*s95 16

这篇文章将讨论以下主题:

  • 如何正确推广到多个 DataFrame(以及为什么merge这里有缺点)
  • 合并唯一键
  • 合并非唯一键

回到顶部



泛化到多个 DataFrame

通常,当多个 DataFrame 需要合并在一起时就会出现这种情况。天真地,这可以通过链接调用来完成merge

df1.merge(df2, ...).merge(df3, ...)
Run Code Online (Sandbox Code Playgroud)

然而,对于许多 DataFrame 来说,这很快就会失控。此外,可能需要对未知数量的数据帧进行泛化。

这里我介绍了针对唯一pd.concat键的多路连接,以及针对非唯一键的多路连接。首先,设置。DataFrame.join

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]
Run Code Online (Sandbox Code Playgroud)

唯一键上的多路合并

如果您的键(此处的键可以是列或索引)是唯一的,那么您可以使用pd.concat. 请注意,pd.concat在索引上连接 DataFrame

# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
    [df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0
Run Code Online (Sandbox Code Playgroud)

省略join='inner'FULL OUTER JOIN。请注意,您不能指定 LEFT 或 RIGHT OUTER 连接(如果您需要这些连接,请使用join,如下所述)。


对具有重复项的键进行多路合并

concat速度快,但也有其缺点。它无法处理重复项。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
Run Code Online (Sandbox Code Playgroud)
df1.merge(df2, ...).merge(df3, ...)
Run Code Online (Sandbox Code Playgroud)

在这种情况下,我们可以使用它join,因为它可以处理非唯一键(请注意,join在索引上连接 DataFrame;它merge在幕后调用并执行 LEFT OUTER JOIN,除非另有指定)。

# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0
Run Code Online (Sandbox Code Playgroud)

继续阅读

跳转到 Pandas Merging 101 中的其他主题继续学习:

* 你在这里


cs9*_*s95 8

这篇文章将讨论以下主题:

  • 不同条件下与索引合并
    • 基于索引的连接选项:merge, join,concat
    • 合并索引
    • 合并索引一,其他列
  • 有效地使用命名索引来简化合并语法

回到顶部



基于索引的连接

TL; 博士

有几个选项,根据用例的不同,有些选项比其他选项更简单。

  1. DataFrame.mergeleft_indexright_index(或left_onright_on使用名称索引)
    • 支持内/左/右/全
    • 一次只能加入两个
    • 支持列-列、索引-列、索引-索引连接
  2. DataFrame.join (加入索引)
    • 支持内/左(默认)/右/全
    • 一次可以加入多个DataFrames
    • 支持索引索引连接
  3. pd.concat (在索引上连接)
    • 支持内部/完整(默认)
    • 一次可以加入多个DataFrames
    • 支持索引索引连接

索引到索引连接

设置和基础

import pandas as pd
import numpy as np

np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)}, 
                    index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame(data={'value': np.random.randn(4)},  
                     index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right
 
           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076
Run Code Online (Sandbox Code Playgroud)

通常,索引内部连接如下所示:

left.merge(right, left_index=True, right_index=True)

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

其他连接遵循类似的语法。

值得注意的替代品

  1. DataFrame.join默认为索引上的连接。DataFrame.join默认情况下执行 LEFT OUTER JOIN,所以how='inner'这里是必要的。

     left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
    
              value_x   value_y
     idxkey                    
     B      -0.402655  0.543843
     D      -0.524349  0.013135
    
    Run Code Online (Sandbox Code Playgroud)

    请注意,我需要指定lsuffixrsuffix参数join,否则会出错:

     left.join(right)
     ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
    
    Run Code Online (Sandbox Code Playgroud)

    由于列名相同。如果它们的名称不同,这将不是问题。

     left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
    
             leftvalue     value
     idxkey                     
     B       -0.402655  0.543843
     D       -0.524349  0.013135
    
    Run Code Online (Sandbox Code Playgroud)
  2. pd.concatjoin 索引,并且可以一次连接两个或多个 DataFrame。默认情况下,它执行完整的外部联接,因此how='inner'此处需要..

     pd.concat([left, right], axis=1, sort=False, join='inner')
    
                value     value
     idxkey                    
     B      -0.402655  0.543843
     D      -0.524349  0.013135
    
    Run Code Online (Sandbox Code Playgroud)

    有关更多信息concat,请参阅此帖子


索引到列连接

为了进行内部联接使用的右左,列的索引,你将使用DataFrame.merge的组合left_index=Trueright_on=...

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
 
  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135
Run Code Online (Sandbox Code Playgroud)

其他联接遵循类似的结构。请注意,只能merge执行索引到列连接。您可以连接多个列,前提是左侧的索引级别数等于右侧的列数。

join并且concat不能进行混合合并。您需要使用DataFrame.set_index.


有效地使用命名索引 [pandas >= 0.23]

如果您的索引已命名,则从 pandas >= 0.23 开始,DataFrame.merge您可以将索引名称指定为on(或left_onright_on根据需要)。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135
Run Code Online (Sandbox Code Playgroud)

对于前面与左索引、右列合并的示例,您可以left_on与左索引名称一起使用:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135
Run Code Online (Sandbox Code Playgroud)

继续阅读

跳转到 Pandas Merging 101 中的其他主题以继续学习:

* 你在这里