使用 pyspark 进行权重采样

Question

使用 pyspark 进行权重采样

Xin*_*ang 6 python sampling apache-spark pyspark

我使用 PySpark 在 Spark 上有一个不平衡的数据帧。我想重新采样以使其平衡。我只在 PySpark 中找到示例函数

sample(withReplacement, fraction, seed=None)

Run Code Online (Sandbox Code Playgroud)

但我想在Python中对单位体积权重的数据帧进行采样，我可以这样做

df.sample(n,Flase,weights=log(unitvolume))

Run Code Online (Sandbox Code Playgroud)

有什么方法可以使用 PySpark 做同样的事情吗？

Answer 1

hi-*_*zir 3

Spark 提供了分层采样工具，但这仅适用于分类数据。您可以尝试将其分桶：

from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, log

df_log = df.withColumn("log_unitvolume", log(col("unitvolume"))
splits = ... # A list of splits

bucketizer = Bucketizer(splits=splits, inputCol="log_unitvolume", outputCol="bucketed_log_unitvolume")

df_log_bucketed = bucketizer.transform(df_log)

Run Code Online (Sandbox Code Playgroud)

计算统计数据：

counts = df.groupBy("bucketed_log_unitvolume")
fractions  = ...  # Define fractions from each bucket:

Run Code Online (Sandbox Code Playgroud)

并使用它们进行采样：

df_log_bucketed.sampleBy("bucketed_log_unitvolume", fractions)

Run Code Online (Sandbox Code Playgroud)

您还可以尝试重新缩放log_unitvolume到 [0, 1] 范围，然后：

from pyspark.sql.functions import rand 

df_log_rescaled.where(col("log_unitvolume_rescaled") < rand())

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年前
查看次数：	6377 次
最近记录：	2 年前