Shu*_*Fan 2 python apache-spark pyspark spark-dataframe apache-spark-mllib
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
Run Code Online (Sandbox Code Playgroud)
如果我运行命令
tokenized.head()
Run Code Online (Sandbox Code Playgroud)
我想得到这样的结果
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
Run Code Online (Sandbox Code Playgroud)
然而,现在的结果是
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
Run Code Online (Sandbox Code Playgroud)
有没有办法通过 PySpark 中的 Tokenizer 或 RegexTokenizer 来实现这一点?
类似的问题在这里?在 PySpark ML 中创建自定义 Transformer
查看pyspark.ml 文档。Tokenizer只用空格分割,但是RegexTokenizer——顾名思义——使用正则表达式来查找分割点或要提取的标记(这可以通过参数配置gaps)。
如果您传递一个空模式并离开gaps=True(这是默认设置),您应该会得到您想要的结果:
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3108 次 |
| 最近记录: |