执行合并时防止行重复

Question

执行合并时防止行重复

kab*_*ame 5 python csv python-2.7 python-3.x pandas

我正在与一个正在进行的数据分析项目碰壁。

本质上，如果我有示例CSV'A'：

id   | item_num
A123 |     1
A123 |     2
B456 |     1

Run Code Online (Sandbox Code Playgroud)

我有示例CSV'B'：

id   | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...

Run Code Online (Sandbox Code Playgroud)

如果执行mergeusing Pandas，它将最终如下所示：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | Mary had a...
A123 |     1    | ...little lamb.
A123 |     2    | ...little lamb.
B456 |     1    | Its fleece...

Run Code Online (Sandbox Code Playgroud)

我该如何使其变为：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb...
B456 |     1    | Its fleece...

Run Code Online (Sandbox Code Playgroud)

这是我的代码：

import pandas as pd

# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))


# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))

# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))

# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")

Run Code Online (Sandbox Code Playgroud)

我真的很感谢您的帮助-我非常困惑！我正在处理20,000多个行。

谢谢。

编辑：我的帖子被标记为可能重复。并非如此，因为我不一定要添加一列-我只是想防止description将乘以item_num归因于特定对象的数量id。

更新6/21：

如果2个DF看起来像这样，该如何合并？

id   | item_num | other_col
A123 |     1    | lorem ipsum
A123 |     2    | dolor sit
A123 |     3    | amet, consectetur
B456 |     1    | lorem ipsum

Run Code Online (Sandbox Code Playgroud)

我有示例CSV'B'：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb.
B456 |     1    | ...Its fleece...

Run Code Online (Sandbox Code Playgroud)

所以我最终得到：

id   | item_num |  other_col  | description
A123 |     1    | lorem ipsum | Mary Had a...
A123 |     2    | dolor sit   | ...little lamb.
B456 |     1    | lorem ipsum | ...Its fleece...

Run Code Online (Sandbox Code Playgroud)

意思是，在“ other_col”中带有“ amet，consectetur”的3的行将被忽略。

Answer 1

Max*_*axU 1

我会这样做：

In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))

In [136]: result
Out[136]:
     id  item_num       description
0  A123         1     Mary had a...
1  A123         2   ...little lamb.
2  B456         1  ...Its fleece...

Run Code Online (Sandbox Code Playgroud)

说明：我们可以在 DF 中创建“虚拟”item_num列B用于连接：

In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
     id       description  item_num
0  A123     Mary had a...         1
1  A123   ...little lamb.         2
2  B456  ...Its fleece...         1

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	93 次
最近记录：	7 年，1 月前