如何在Number pyspark或NLP之间进行分配

Question

如何在Number pyspark或NLP之间进行分配

ver*_*cla 3 nlp dataframe apache-spark pyspark

我想在数据框的一栏中进行多次拆分。例：

s = "Cras mattis MP the -69661/69662;69663 /IS4567"

Run Code Online (Sandbox Code Playgroud)

我如何获得：

s = ['Cras', 'mattis', 'MP', 'the', '69661', '69662', '69663', 'IS4567' ]

Run Code Online (Sandbox Code Playgroud)

谢谢

Answer 1

jxc*_*jxc 8

使用SparkSQL的内置函数的方法之一句子（）和扁平化（） [需要火花2.4.0+的扁平化（） ]：

from pyspark.sql.functions import expr

df.withColumn('new_s', expr('flatten(sentences(s))')).show(truncate=False)    
#+---------------------------------------------+----------------------------------------------------+
#|s                                            |new_s                                               |
#+---------------------------------------------+----------------------------------------------------+
#|Cras mattis MP the -69661/69662;69663 /IS4567|[Cras, mattis, MP, the, 69661, 69662, 69663, IS4567]|
#+---------------------------------------------+----------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

Apache Hive文档中的句子（）有什么作用：

将一串自然语言文本标记为单词和句子，其中每个句子在适当的句子边界处断开并作为单词数组返回。“ lang”和“ locale”是可选参数。例如，句子（'Hello there！您好吗？'）返回（（“ Hello”，“ there”），（“ How”，“ are”，“ you”）））。

归档时间：	6 年，4 月前
查看次数：	116 次
最近记录：	6 年，4 月前