仅当行不存在时才将行添加到 Pandas DataFrame

Question

仅当行不存在时才将行添加到 Pandas DataFrame

我正在逐步将来自网络抓取的数据行附加到 DataFrame 中。虽然，有时我正在抓取的数据已经存在于 DataFrame 中，所以我不想再次附加它。检查 DataFrame 是否已有数据的最有效方法是什么？在末尾删除重复项不是一个选项，因为我想提取特定数量的记录，并且在末尾删除重复项将使最终 DataFrame 的记录少于指定数量。

\n

res = pd.DataFrame([], columns=GD_SCHEMA)\n\nreviews = self.browser.find_elements_by_class_name('empReview')\nidx = 0\nfor review in reviews:\n    data = extract_review(review) # This is a dict with the same keys as \xc2\xb4res\xc2\xb4\n    \n    # Most efficient way to check if \xc2\xb4data\xc2\xb4 already exists in \xc2\xb4res\xc2\xb4 before appending?\n    res.loc[idx] = data\n    idx += 1\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

And*_*asT 1

我建议使用中间字典。如果您明智地选择字典的键，以便重复项的哈希值相等，您将获得一个没有重复项的字典，一旦达到所需的长度，您就可以将其加载到数据帧中。

这真的是最好的答案吗？在我写这篇评论时，最好的选择似乎是：不要使用 pandas 数据帧，而是使用简单的哈希图，或者有人建议在表上执行线性复杂度搜索。 (3认同)

归档时间：	6 年，8 月前
查看次数：	18918 次
最近记录：	2 年，6 月前