我是Spark SQL DataFrames和ML的新手(PySpark).如何创建服装标记器,例如删除停用词并使用nltk中的某些库?我可以延长默认值吗?
谢谢.
上下文: 我有一个包含两列的数据框:标签和功能.
org.apache.spark.sql.DataFrame = [label: int, features: vector]
其中features是使用VectorAssembler构建的数值类型的mllib.linalg.VectorUDT.
问题: 有没有办法为特征向量分配模式?我想跟踪每个功能的名称.
到目前为止尝试过:
val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
scala> attrGroup.toMetadata 
res197: org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"f1"},{"idx":1,"name":"f2"},{"idx":2,"name":"f3"}]},"num_attrs":3}}
但不确定如何将其应用于现有数据框.
我正在尝试在 Spark 中的随机森林上运行交叉验证。
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
data = nds.sc.parallelize([
 LabeledPoint(0.0, [0,402,6,0]),
 LabeledPoint(0.0, [3,500,3,0]),
 LabeledPoint(1.0, [1,590,1,1]),
 LabeledPoint(1.0, [3,328,5,0]),
 LabeledPoint(1.0, [4,351,4,0]),
 LabeledPoint(0.0, [2,372,2,0]),
 LabeledPoint(0.0, [4,302,5,0]),
 LabeledPoint(1.0, [1,387,2,0]),
 LabeledPoint(1.0, [1,419,3,0]),
 LabeledPoint(0.0, [1,370,5,0]),
 LabeledPoint(0.0, [1,410,4,0]),
 LabeledPoint(0.0, [2,509,7,1]),
 LabeledPoint(0.0, [1,307,5,0]),
 LabeledPoint(0.0, [0,424,4,1]),
 LabeledPoint(0.0, [1,509,2,1]),
 LabeledPoint(1.0, [3,361,4,0]),
 ])
train=data.toDF(['label','features'])
numfolds =2
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator()  
paramGrid = ParamGridBuilder().addGrid(rf.maxDepth,      
[4,8,10]).addGrid(rf.impurity, ['entropy','gini']).addGrid(rf.featureSubsetStrategy, [6,8,10]).build()
pipeline = Pipeline(stages=[rf])
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid, …如何在PySpark中更新列元数据?我有元数据值对应于分类(字符串)功能的标称编码,我想以自动方式解码它们.除非重新创建架构,否则无法直接在pyspark API中编写元数据.是否有可能在PySpark编辑元数据在旅途中无需转换数据集RDD并将其转换回,提供了完整的模式描述(如描述这里)?
示例清单:
# Create DF
df.show()
# +---+-------------+
# | id|     features|
# +---+-------------+
# |  0|[1.0,1.0,4.0]|
# |  1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column
# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|        [1.0]|
# |  1|        [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with …