小编Ped*_*mas的帖子

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:

    dictionary = {
        t.id: (
            t.text,
            t.set,
            t.compare_string
        )
        for t in dataframe.itertuples()
    }

    highly_similar = []

    for a, b in itertools.combinations(dictionary.items(), 2):
        if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
            similarity_score = fuzz.ratio(a[1][0], b[1][0])

            if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
                highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])

Run Code Online (Sandbox Code Playgroud)

This script takes around 15 minutes to run, the dataframe contains …

python

Ped*_*mas

lucky-day

5
推荐指数

1
解决办法

360
查看次数

将布尔列转换为 pandas 时出现 Pyspark 错误

我试图在一个带有 id 列（int）、score 列（float）和“pass”列（boolean）的简单数据帧上使用 pyspark 的 toPandas() 函数。

我的问题是，每当我调用该函数时，我都会收到此错误：

>       raise AttributeError("module {!r} has no attribute "
                             "{!r}".format(__name__, attr))
E       AttributeError: module 'numpy' has no attribute 'bool'

/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError

Run Code Online (Sandbox Code Playgroud)

柱子：

0    False
1    False
2    False
3     True
Name: pass, dtype: bool
Column<'pass'>

Run Code Online (Sandbox Code Playgroud)

我需要手动将此列转换为其他类型吗？

pyspark

Ped*_*mas

lucky-day

4
推荐指数

1
解决办法

1702
查看次数

标签统计

pyspark ×1

python ×1

Improve performance of combinations

将布尔列转换为 pandas 时出现 Pyspark 错误

标签 统计

小编Ped_mas的帖子

标签统计