是否有可能在 PySpark 中同时为不同的列创建数据透视表?我有一个像这样的数据框:
from pyspark.sql import functions as sf
import pandas as pd
sdf = spark.createDataFrame(
pd.DataFrame([[1, 'str1', 'str4'], [1, 'str1', 'str4'], [1, 'str2', 'str4'], [1, 'str2', 'str5'],
[1, 'str3', 'str5'], [2, 'str2', 'str4'], [2, 'str2', 'str4'], [2, 'str3', 'str4'],
[2, 'str3', 'str5']], columns=['id', 'col1', 'col2'])
)
# +----+------+------+
# | id | col1 | col2 |
# +----+------+------+
# | 1 | str1 | str4 |
# | 1 | str1 | str4 |
# | 1 | str2 …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用 RandomForest 恢复多输出分类器的特征重要性。
MultiOutput 模型没有显示任何问题:
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
## Generate data
x, y = make_multilabel_classification(n_samples=1000,
n_features=15,
n_labels = 5,
n_classes=3,
random_state=12,
allow_unlabeled = True)
x_train = x[:700,:]
x_test = x[701:,:]
y_train = y[:700,:]
y_test = y[701:,:]
## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)
## Make prediction …Run Code Online (Sandbox Code Playgroud)