BCK*_*CKN 7 python merge append dataframe pandas
我有一个问题要在同一个数据帧(start_end)中将两列合并为一列,还要删除空值。我打算将“起点站”和“终点站”合并到“站”中,并根据新的“站”列保留“持续时间”。我尝试过 pd.merge、pd.concat、pd.append,但我无法解决。
Start_end 的数据帧:
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. NaN
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
预期输出:
Duration stations
14 1407 14th & V St NW
19 509 21st & I St NW
20 638 15th & P St NW
27 1532 Massachusetts Ave & Dupont Circle NW
28 759 Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
我到目前为止的代码:
#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])
Run Code Online (Sandbox Code Playgroud)
这就是我试图做的:
station = pd.merge([start_end['Start station'],start_end['End station']])
Run Code Online (Sandbox Code Playgroud)
fillna如果NaN真的是空值
df.assign(**{
'Start station': df['Start station'].fillna(df['End station'])})
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. 15th & P St NW.
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
mask如果NaN是字符串
df.assign(**{
'Start station': df['Start station'].mask(
lambda x: x == 'NaN', df['End station'])})
Duration End station Start station
14 1407 NaN 14th & V St NW
19 509 NaN 21st & I St NW
20 638 15th & P St NW. 15th & P St NW.
27 1532 NaN Massachusetts Ave & Dupont Circle NW
28 759 NaN Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
使用combine_first。将 col1 中的空值替换为col2
df["station"] = df["End station"].combine_first(df["Start station"])
df.drop(["End station", "Start station"], 1, inplace=True)
Run Code Online (Sandbox Code Playgroud)
>>> df
Duration End station Start station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
为两列指定相同的名称
>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
Duration station station
0 1407 NaN 14th & V St NW
1 509 NaN 21st & I St NW
2 638 15th & P St NW. NaN
3 1532 NaN Massachusetts Ave & Dupont Circle NW
4 759 NaN Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
堆叠然后取消堆叠。
>>> s = df.stack()
>>> s
0 Duration 1407
station 14th & V St NW
1 Duration 509
station 21st & I St NW
2 Duration 638
station 15th & P St NW.
3 Duration 1532
station Massachusetts Ave & Dupont Circle NW
4 Duration 759
station Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
Duration station
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
>>>
Run Code Online (Sandbox Code Playgroud)
我认为这是这样运作的:
.stack使用 MultiIndex 创建一个系列并为您处理空值。它在列名称上对齐第二级,并且由于列名称相同,因此只有一个 - 拆栈只会生成一列。
如果您不更改列名称,这实际上只是根据索引之间的差异进行的猜测。
>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])
>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])
Run Code Online (Sandbox Code Playgroud)
看起来有点棘手,也许有人会评论它。
替代方案 - 使用pd.concat和.dropna
>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2 15th & P St NW.
0 14th & V St NW
1 21st & I St NW
3 Massachusetts Ave & Dupont Circle NW
4 Adams Mill & Columbia Rd NW
Name: stations, dtype: object
>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
Duration stations
0 1407 14th & V St NW
1 509 21st & I St NW
2 638 15th & P St NW.
3 1532 Massachusetts Ave & Dupont Circle NW
4 759 Adams Mill & Columbia Rd NW
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
12386 次 |
| 最近记录: |