小编Ped*_*mas的帖子

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:

    dictionary = {
        t.id: (
            t.text,
            t.set,
            t.compare_string
        )
        for t in dataframe.itertuples()
    }

    highly_similar = []

    for a, b in itertools.combinations(dictionary.items(), 2):
        if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
            similarity_score = fuzz.ratio(a[1][0], b[1][0])

            if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
                highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
Run Code Online (Sandbox Code Playgroud)

This script takes around 15 minutes to run, the dataframe contains …

python

5
推荐指数
1
解决办法
360
查看次数

将布尔列转换为 pandas 时出现 Pyspark 错误

我试图在一个带有 id 列(int)、score 列(float)和“pass”列(boolean)的简单数据帧上使用 pyspark 的 toPandas() 函数。

我的问题是,每当我调用该函数时,我都会收到此错误:

>       raise AttributeError("module {!r} has no attribute "
                             "{!r}".format(__name__, attr))
E       AttributeError: module 'numpy' has no attribute 'bool'

/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Run Code Online (Sandbox Code Playgroud)

柱子:

0    False
1    False
2    False
3     True
Name: pass, dtype: bool
Column<'pass'>
Run Code Online (Sandbox Code Playgroud)

我需要手动将此列转换为其他类型吗?

pyspark

4
推荐指数
1
解决办法
1702
查看次数

标签 统计

pyspark ×1

python ×1