小编Sur*_*ain的帖子

Pyspark 中的过采样或 SMOTE

我有 7 个类,记录总数为 115,我想对这些数据运行随机森林模型。但由于数据不足以获得高精度。因此,我想对所有类应用过采样,以使多数类本身获得更高的计数,然后相应地获得少数。这在 PySpark 中可能吗?

+---------+-----+
| SubTribe|count|
+---------+-----+
|    Chill|   10|
|     Cool|   18|
|Adventure|   18|
|    Quirk|   13|
|  Mystery|   25|
|    Party|   18|
|Glamorous|   13|
+---------+-----+
Run Code Online (Sandbox Code Playgroud)

machine-learning random-forest pyspark oversampling

6
推荐指数
2
解决办法
7974
查看次数

如何在pyspark中将列表列表合并为单个列表

在 Spark 数据框中,我有 1 列包含列表列表作为行。我想将字符串列表合并为一个。

INPUT DATAFRAME:
+-------+--------------------+
| name  |friends             |
+-------+--------------------+
| Jim   |[["C","A"]["B","C"]]|
+-------+--------------------+
| Bill  |[["E","A"]["F","L"]]|
+-------+--------------------+
| Kim   |[["C","K"]["L","G"]]| 
+-------+--------------------+

OUTPUT DATAFRAME:  

+-------+--------------------+
| name  |friends             |
+-------+--------------------+
| Jim   |["C","A","B"]       |
+-------+--------------------+
| Bill  |["E","A","F","L"]   |
+-------+--------------------+
| Kim   |["C","K","L","G"]   | 
+-------+--------------------+
Run Code Online (Sandbox Code Playgroud)

我想将列表列表合并为单个列表并删除重复项。提前致谢

dataframe apache-spark pyspark

4
推荐指数
1
解决办法
6801
查看次数