Eva*_*mir 6 python nlp apache-spark pyspark apache-spark-ml
想知道是否有内置的Spark功能将1,2,n-gram功能组合到一个词汇表中.设置n=2在NGram随后的调用CountVectorizer仅含有2克导致字典.我真正想要的是将所有频繁的1克,2克等组合成一个字典用于我的语料库.
zer*_*323 12
您可以使用单独NGram和CountVectorizer模型进行训练和合并VectorAssembler.
from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline
def build_ngrams(inputCol="tokens", n=3):
ngrams = [
NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
for i in range(1, n + 1)
]
vectorizers = [
CountVectorizer(inputCol="{0}_grams".format(i),
outputCol="{0}_counts".format(i))
for i in range(1, n + 1)
]
assembler = [VectorAssembler(
inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
outputCol="features"
)]
return Pipeline(stages=ngrams + vectorizers + assembler)
Run Code Online (Sandbox Code Playgroud)
用法示例:
df = spark.createDataFrame([
(1, ["a", "b", "c", "d"]),
(2, ["d", "e", "d"])
], ("id", "tokens"))
build_ngrams().fit(df).transform(df)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2630 次 |
| 最近记录: |