在 pandas/python 的同一数据框中将两列合并为一列

Question

在 pandas/python 的同一数据框中将两列合并为一列

BCK*_*CKN 7 python merge append dataframe pandas

我有一个问题要在同一个数据帧（start_end）中将两列合并为一列，还要删除空值。我打算将“起点站”和“终点站”合并到“站”中，并根据新的“站”列保留“持续时间”。我尝试过 pd.merge、pd.concat、pd.append，但我无法解决。

Start_end 的数据帧：

    Duration    End station     Start station
14  1407        NaN             14th & V St NW
19  509         NaN             21st & I St NW
20  638         15th & P St NW.  NaN
27  1532        NaN              Massachusetts Ave & Dupont Circle NW
28  759         NaN              Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

预期输出：

    Duration    stations
14  1407        14th & V St NW
19  509         21st & I St NW
20  638         15th & P St NW
27  1532        Massachusetts Ave & Dupont Circle NW
28  759         Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

我到目前为止的代码：

#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])

Run Code Online (Sandbox Code Playgroud)

这就是我试图做的：

station = pd.merge([start_end['Start station'],start_end['End station']])

Run Code Online (Sandbox Code Playgroud)

Answer 1

piR*_*red 9

`fillna`

如果NaN真的是空值

df.assign(**{
    'Start station': df['Start station'].fillna(df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

`mask`

如果NaN是字符串

df.assign(**{
    'Start station': df['Start station'].mask(
        lambda x: x == 'NaN', df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

@BCKN，我很抱歉，但我对这个评论感到惊讶。这篇文章包含完成您的任务所需的所有信息。我可能没有以一种看起来像银盘的方式呈现它，但那是因为我希望你能从我的答案中得到你需要的东西。让我烦恼的是，你似乎期望我把这个放在你想要的盘子上。它可能会提醒您，每个在 SO 上发布答案的人都是志愿者，他们很可能会花时间做其他事情而不是帮助您。如果我误解了您，请告诉我。 (2认同)

Answer 2

sjd*_*sjd 6

使用combine_first。将 col1 中的空值替换为col2

df["station"] = df["End station"].combine_first(df["Start station"])
df.drop(["End station", "Start station"], 1, inplace=True)

Run Code Online (Sandbox Code Playgroud)

Answer 3

wwi*_*wii 5

>>> df
   Duration      End station                         Start station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

为两列指定相同的名称

>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
   Duration          station                               station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

堆叠然后取消堆叠。

>>> s = df.stack()
>>> s
0  Duration                                    1407
   station                           14th & V St NW
1  Duration                                     509
   station                           21st & I St NW
2  Duration                                     638
   station                          15th & P St NW.
3  Duration                                    1532
   station     Massachusetts Ave & Dupont Circle NW
4  Duration                                     759
   station              Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
  Duration                               station
0     1407                        14th & V St NW
1      509                        21st & I St NW
2      638                       15th & P St NW.
3     1532  Massachusetts Ave & Dupont Circle NW
4      759           Adams Mill & Columbia Rd NW
>>>

Run Code Online (Sandbox Code Playgroud)

我认为这是这样运作的：

.stack使用 MultiIndex 创建一个系列并为您处理空值。它在列名称上对齐第二级，并且由于列名称相同，因此只有一个 - 拆栈只会生成一列。

如果您不更改列名称，这实际上只是根据索引之间的差异进行的猜测。

>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])

>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

Run Code Online (Sandbox Code Playgroud)

看起来有点棘手，也许有人会评论它。

替代方案 - 使用pd.concat和.dropna

>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2                         15th & P St NW.
0                          14th & V St NW
1                          21st & I St NW
3    Massachusetts Ave & Dupont Circle NW
4             Adams Mill & Columbia Rd NW
Name: stations, dtype: object

>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
   Duration                              stations
0      1407                        14th & V St NW
1       509                        21st & I St NW
2       638                       15th & P St NW.
3      1532  Massachusetts Ave & Dupont Circle NW
4       759           Adams Mill & Columbia Rd NW

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	12386 次
最近记录：	4 年，11 月前