Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
Run Code Online (Sandbox Code Playgroud)
This script takes around 15 minutes to run, the dataframe contains …
我试图在一个带有 id 列(int)、score 列(float)和“pass”列(boolean)的简单数据帧上使用 pyspark 的 toPandas() 函数。
我的问题是,每当我调用该函数时,我都会收到此错误:
> raise AttributeError("module {!r} has no attribute "
"{!r}".format(__name__, attr))
E AttributeError: module 'numpy' has no attribute 'bool'
/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Run Code Online (Sandbox Code Playgroud)
柱子:
0 False
1 False
2 False
3 True
Name: pass, dtype: bool
Column<'pass'>
Run Code Online (Sandbox Code Playgroud)
我需要手动将此列转换为其他类型吗?