相关疑难解决方法(0)

在PySpark ML中创建自定义Transformer

我是Spark SQL DataFrames和ML的新手(PySpark).如何创建服装标记器,例如删除停用词并使用某些库?我可以延长默认值吗?

谢谢.

python nltk apache-spark pyspark apache-spark-ml

19
推荐指数
1
解决办法
1万
查看次数

将元数据附加到Spark中的矢量列

上下文: 我有一个包含两列的数据框:标签和功能.

org.apache.spark.sql.DataFrame = [label: int, features: vector]
Run Code Online (Sandbox Code Playgroud)

其中features是使用VectorAssembler构建的数值类型的mllib.linalg.VectorUDT.

问题: 有没有办法为特征向量分配模式?我想跟踪每个功能的名称.

到目前为止尝试过:

val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("feat1", "feat2", "feat3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
Run Code Online (Sandbox Code Playgroud)
scala> attrGroup.toMetadata 
res197: org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"f1"},{"idx":1,"name":"f2"},{"idx":2,"name":"f3"}]},"num_attrs":3}}
Run Code Online (Sandbox Code Playgroud)

但不确定如何将其应用于现有数据框.

scala apache-spark apache-spark-ml apache-spark-mllib

10
推荐指数
1
解决办法
3351
查看次数

Spark随机森林交叉验证错误

我正在尝试在 Spark 中的随机森林上运行交叉验证。

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

data = nds.sc.parallelize([
 LabeledPoint(0.0, [0,402,6,0]),
 LabeledPoint(0.0, [3,500,3,0]),
 LabeledPoint(1.0, [1,590,1,1]),
 LabeledPoint(1.0, [3,328,5,0]),
 LabeledPoint(1.0, [4,351,4,0]),
 LabeledPoint(0.0, [2,372,2,0]),
 LabeledPoint(0.0, [4,302,5,0]),
 LabeledPoint(1.0, [1,387,2,0]),
 LabeledPoint(1.0, [1,419,3,0]),
 LabeledPoint(0.0, [1,370,5,0]),
 LabeledPoint(0.0, [1,410,4,0]),
 LabeledPoint(0.0, [2,509,7,1]),
 LabeledPoint(0.0, [1,307,5,0]),
 LabeledPoint(0.0, [0,424,4,1]),
 LabeledPoint(0.0, [1,509,2,1]),
 LabeledPoint(1.0, [3,361,4,0]),
 ])


train=data.toDF(['label','features'])

numfolds =2

rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator()  


paramGrid = ParamGridBuilder().addGrid(rf.maxDepth,      
[4,8,10]).addGrid(rf.impurity, ['entropy','gini']).addGrid(rf.featureSubsetStrategy, [6,8,10]).build()

pipeline = Pipeline(stages=[rf])

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid, …
Run Code Online (Sandbox Code Playgroud)

python apache-spark pyspark apache-spark-mllib

5
推荐指数
1
解决办法
4100
查看次数

如何在pyspark中更改列元数据?

如何在PySpark中更新列元数据?我有元数据值对应于分类(字符串)功能的标称编码,我想以自动方式解码它们.除非重新创建架构,否则无法直接在pyspark API中编写元数据.是否有可能在PySpark编辑元数据在旅途中无需转换数据集RDD并将其转换回,提供了完整的模式描述(如描述这里)?

示例清单:

# Create DF
df.show()

# +---+-------------+
# | id|     features|
# +---+-------------+
# |  0|[1.0,1.0,4.0]|
# |  1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column

# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|        [1.0]|
# |  1|        [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with …
Run Code Online (Sandbox Code Playgroud)

metadata apache-spark pyspark apache-spark-ml

4
推荐指数
1
解决办法
3052
查看次数