小编Pau*_*ulH的帖子

如何在 PySpark 中分别对多个列进行旋转

是否有可能在 PySpark 中同时为不同的列创建数据透视表?我有一个像这样的数据框:

from pyspark.sql import functions as sf
import pandas as pd
sdf = spark.createDataFrame(
    pd.DataFrame([[1, 'str1', 'str4'], [1, 'str1', 'str4'], [1, 'str2', 'str4'], [1, 'str2', 'str5'],
        [1, 'str3', 'str5'], [2, 'str2', 'str4'], [2, 'str2', 'str4'], [2, 'str3', 'str4'],
        [2, 'str3', 'str5']], columns=['id', 'col1', 'col2'])
)
# +----+------+------+
# | id | col1 | col2 |
# +----+------+------+
# |  1 | str1 | str4 |
# |  1 | str1 | str4 |
# |  1 | str2 …
Run Code Online (Sandbox Code Playgroud)

python pivot multiple-columns apache-spark pyspark

5
推荐指数
1
解决办法
6988
查看次数

我可以显示 MultiOutputClassifier 的特征重要性吗?

我正在尝试使用 RandomForest 恢复多输出分类器的特征重要性。

MultiOutput 模型没有显示任何问题:

import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

## Generate data
x, y = make_multilabel_classification(n_samples=1000, 
                                      n_features=15, 
                                      n_labels = 5, 
                                      n_classes=3, 
                                      random_state=12, 
                                      allow_unlabeled = True)
x_train = x[:700,:]
x_test  = x[701:,:]
y_train = y[:700,:]
y_test  = y[701:,:]

## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)

## Make prediction …
Run Code Online (Sandbox Code Playgroud)

python random-forest scikit-learn

4
推荐指数
1
解决办法
1351
查看次数