我有一个Python类,我用它来加载和处理Spark中的一些数据.在我需要做的各种事情中,我正在生成一个从Spark数据帧中的各个列派生的虚拟变量列表.我的问题是我不确定如何正确定义用户定义函数来完成我需要的东西.
我做目前有,当映射了潜在的数据帧RDD,解决了问题的一半(记住,这是在一个更大的方法等data_processor类):
def build_feature_arr(self,table):
# this dict has keys for all the columns for which I need dummy coding
categories = {'gender':['1','2'], ..}
# there are actually two differnt dataframes that I need to do this for, this just specifies which I'm looking at, and grabs the relevant features from a config file
if table == 'users':
iter_over = self.config.dyadic_features_to_include
elif table == 'activty':
iter_over = self.config.user_features_to_include
def _build_feature_arr(row):
result = []
row = row.asDict()
for …Run Code Online (Sandbox Code Playgroud) python apache-spark apache-spark-sql apache-spark-ml apache-spark-mllib
我想知道是否有简洁的方法在pyspark中的DataFrame上运行ML(例如KMeans),如果我有多个数字列中的功能.
即在Iris数据集中:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
Run Code Online (Sandbox Code Playgroud)
我想使用KMeans而不重新创建DataSet,并将功能向量手动添加为新列,并在代码中重复硬编码原始列.
我想改进的解决方案:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features") …Run Code Online (Sandbox Code Playgroud) 我想在 Pyspark 上运行随机森林算法。Pyspark 文档中提到VectorAssembler 仅接受数字或布尔数据类型。因此,如果我的数据包含 Stringtype 变量,例如城市名称,我是否应该对它们进行 one-hot 编码,以便进一步进行随机森林分类/回归?
这是我一直在尝试的代码,输入文件在这里:
train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename')
drop_list = ["Country", "Carrier", "TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))#only this variable is actually double, rest of them are strings
junk = train.select([column for column in train.columns if column in drop_list])
transformed = assembler.transform(junk)
Run Code Online (Sandbox Code Playgroud)
我不断收到错误消息IllegalArgumentException: u'Data type StringType is not supported.'
PS:抱歉问了一个基本问题。我来自 R 背景。在R中,当我们进行随机森林时,不需要将分类变量转换为数值变量。