小编erw*_*nlc的帖子

Python从其他字符串列表中计算列表中的子字符串数,不重复

我有两个清单:

main_list = ['Smith', 'Smith', 'Roger', 'Roger-Smith', '42']
master_list = ['Smith', 'Roger']

Run Code Online (Sandbox Code Playgroud)

我想计算在main_list字符串中从master_list中找到字符串的次数,而不计算相同项目的两倍.

示例:对于上面的两个列表,我的函数的结果应该是4.'mith'可以在main_list中检索3次."罗杰可以被发现2次,但是'史密斯'已经在'罗杰 - 史密斯'中找到了,这个已经不再算了,所以'罗杰'只计算为1,总计4.

我写的知道功能如下,但我认为有更快的方法:

def string_detection(master_list, main_list):
    count = 0
    for substring in master_list:
        temp = list(main_list)
        for string in temp:
            if substring in string:
                main_list.remove(string)
                count+=1
    return count

Run Code Online (Sandbox Code Playgroud)

python string list

erw*_*nlc

2017 02-16

5
推荐指数

1
解决办法

3214
查看次数

将 PySpark DenseVector 转换为数组

我正在尝试将 DenseVector 的 pyspark 数据帧列转换为数组，但我总是遇到错误。

data = [(Vectors.dense([8.0, 1.0, 3.0, 2.0, 5.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]

df = spark.createDataFrame(data,["features"])

Run Code Online (Sandbox Code Playgroud)

我试图定义一个 UDF 并使用 toArray()

to_array = udf(lambda x: x.toArray(), ArrayType(FloatType()))
df = df.withColumn('features', to_array('features'))

Run Code Online (Sandbox Code Playgroud)

但是，如果我执行 df.collect()，我会收到以下错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 17.0 failed 4 times, 
most recent failure: Lost task 1.3 in stage 17.0 (TID 100, 10.139.64.6, executor 0): 
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict 
(for numpy.core.multiarray._reconstruct) …

Run Code Online (Sandbox Code Playgroud)

python pyspark

erw*_*nlc

lucky-day

4
推荐指数

1
解决办法

5414
查看次数