Python-PySpark的Pickle Spacy

Question

Python-PySpark的Pickle Spacy

Chr*_*s C 5 python user-defined-functions apache-spark pyspark

Spacy 2.0的文档提到开发人员已添加了一些功能，以允许对Spacy进行腌制，以便可以由PySpark连接的Spark Cluster使用它，但是，他们没有提供有关如何执行此操作的说明。

有人可以解释我如何腌制Spacy的英语NE解析器以在udf函数中使用吗？

这不起作用：

from pyspark import cloudpickle
nlp = English()
pickled_nlp = cloudpickle.dumps(nlp)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Chr*_*s C 5

不是真正的答案，而是我发现的最好的解决方法：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy

def get_entities_udf():
    def get_entities(text):
        global nlp
        try:
            doc = nlp(unicode(text))
        except:
            nlp = spacy.load('en')
            doc = nlp(unicode(text))
        return [t.label_ for t in doc.ents]
    res_udf = udf(get_entities, StringType(ArrayType()))
    return res_udf

documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	1015 次
最近记录：	7 年，2 月前