ysa*_*oto 3 python python-2.7 apache-spark pyspark
我想从 PySpark 上的数据框中进行分层抽样。有一个sampleBy(col, fractions, seed=None)功能,但它似乎只使用一列作为层。有没有办法使用多个列作为层?
基于这里的答案
将其转换为 python 后,我认为答案可能如下所示:
#create a dataframe to use
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])
#we are going to use the first two columns as our key (strata)
#assign sampling percentages to each key # you could do something cooler here
fractions = df.rdd.map(lambda x: (x[0],x[1])).distinct().map(lambda x: (x,0.3)).collectAsMap()
#setup how we want to key the dataframe
kb = df.rdd.keyBy(lambda x: (x[0],x[1]))
#create a dataframe after sampling from our newly keyed rdd
#note, if the sample did not return any values you'll get a `ValueError: RDD is empty` error
sampleddf = kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns)
sampleddf.show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 1|1234|282|
| 1|1396|179|
| 3|1620|191|
+---+----+---+
#other examples
kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 2|8620|178|
+---+----+---+
kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 1|1234|282|
| 1|1396|179|
+---+----+---+
Run Code Online (Sandbox Code Playgroud)
这是您要找的那种东西吗?
| 归档时间: |
|
| 查看次数: |
4902 次 |
| 最近记录: |