所以我有一些我正在尝试使用的CSV文件,但是其中一些文件具有多个具有相同名称的列.
例如,我可以有这样的csv:
ID Name a a a b b
1 test1 1 NaN NaN "a" NaN
2 test2 NaN 2 NaN "a" NaN
3 test3 2 3 NaN NaN "b"
4 test4 NaN NaN 4 NaN "b"
Run Code Online (Sandbox Code Playgroud)
加载到pandasis给我这个:
ID Name a a.1 a.2 b b.1
1 test1 1 NaN NaN "a" NaN
2 test2 NaN 2 NaN "a" NaN
3 test3 2 3 NaN NaN "b"
4 test4 NaN NaN 4 NaN "b"
Run Code Online (Sandbox Code Playgroud)
我想要做的是将这些相同的名称列合并为1列(如果有多个值保持这些值分开),我理想的输出将是这个
ID Name a b
1 test1 "1" "a"
2 test2 "2" "a"
3 test3 "2;3" "b"
4 test4 "4" "b"
Run Code Online (Sandbox Code Playgroud)
所以想知道这是否可行?
您可以使用groupby上axis=1,并且用类似试验
>>> def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
>>> df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))
ID Name a b
0 1 test1 1.0 a
1 2 test2 2.0 a
2 3 test3 2.0;3.0 b
3 4 test4 4.0 b
Run Code Online (Sandbox Code Playgroud)
而不是使用.astype(str),你可以使用你想要的任何格式化运算符.
重复的列名可能不是一个好主意,但它会起作用:
In [72]:
df2=df[['ID', 'Name']]
df2['a']='"'+df.T[df.columns.values=='a'].apply(lambda x: ';'.join(["%i"%item for item in x[x.notnull()]]))+'"' #these columns are of float dtype
df2['b']=df.T[df.columns.values=='b'].apply(lambda x: ';'.join([item for item in x[x.notnull()]])) #these columns are of objects dtype
print df2
ID Name a b
0 1 test1 "1" "a"
1 2 test2 "2" "a"
2 3 test3 "2;3" "b"
3 4 test4 "4" "b"
[4 rows x 4 columns]
Run Code Online (Sandbox Code Playgroud)
当然,DSM 和 CT Zhu 给出了非常简洁的答案,它们利用了 Python 的许多内置功能,特别是数据帧。这里有点——[咳嗽]——冗长。
def myJoiner(row):
newrow = []
for r in row:
if not pandas.isnull(r):
newrow.append(str(r))
return ';'.join(newrow)
def groupCols(df, key):
columns = df.select(lambda col: key in col, axis=1)
joined = columns.apply(myJoiner, axis=1)
joined.name = key
return pandas.DataFrame(joined)
import pandas
from io import StringIO # python 3.X
#from StringIO import StringIO #python 2.X
data = StringIO("""\
ID Name a a a b b
1 test1 1 NaN NaN "a" NaN
2 test2 NaN 2 NaN "a" NaN
3 test3 2 3 NaN NaN "b"
4 test4 NaN NaN 4 NaN "b"
""")
df = pandas.read_table(data, sep='\s+')
df.set_index(['ID', 'Name'], inplace=True)
AB = groupCols(df, 'a').join(groupCols(df, 'b'))
print(AB)
Run Code Online (Sandbox Code Playgroud)
这给了我:
a b
ID Name
1 test1 1.0 a
2 test2 2.0 a
3 test3 2.0;3.0 b
4 test4 4.0 b
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7290 次 |
| 最近记录: |