two*_*pad 3 apache-spark apache-spark-sql pyspark apache-spark-ml apache-spark-mllib
在PySpark中是否有一种方法可以执行特征选择,但是保留或获取回原始特征索引/描述的映射?
例如:
CountVectorizer(col ="features")将它们转换为数字计数
.ChiSqSelector
以选择前1000个功能(col ="selectedFeatures").如何获取与前1000个特征相对应的原始特征字符串(或者甚至只是步骤#2中原始"特征"col中这些所选特征的相应索引)?
可以使用此信息追溯此信息fitted Transformers.有了Pipeline像这样的:
from pyspark.ml.feature import *
from pyspark.ml import Pipeline
import numpy as np
data = spark.createDataFrame(
[(1, ["spark", "foo", "bar"]), (0, ["kafka", "bar", "foo"])],
("label", "rawFeatures"))
model = Pipeline(stages = [
CountVectorizer(inputCol="rawFeatures", outputCol="features"),
ChiSqSelector(outputCol="selectedFeatures", numTopFeatures=2)
]).fit(data)
Run Code Online (Sandbox Code Playgroud)
你可以提取Transformers:
vectorizer, chisq = model.stages
Run Code Online (Sandbox Code Playgroud)
和比较selectedFeatures有vocabulary:
np.array(vectorizer.vocabulary)[chisq.selectedFeatures]
Run Code Online (Sandbox Code Playgroud)
array(['spark', 'kafka'], dtype='<U5')
Run Code Online (Sandbox Code Playgroud)
不幸的是,这种组合Transformers不会保留标签元数据:
features_meta, selected_features_meta = (f.metadata for f in model
.transform(data).select("features", "selectedFeatures")
.schema
.fields)
features_meta
Run Code Online (Sandbox Code Playgroud)
{}
Run Code Online (Sandbox Code Playgroud)
selected_features_meta
Run Code Online (Sandbox Code Playgroud)
{'ml_attr': {'attrs': {'nominal': [{'idx': 0}, {'idx': 1}]}, 'num_attrs': 2}}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
846 次 |
| 最近记录: |