我有 7 个类,记录总数为 115,我想对这些数据运行随机森林模型。但由于数据不足以获得高精度。因此,我想对所有类应用过采样,以使多数类本身获得更高的计数,然后相应地获得少数。这在 PySpark 中可能吗?
+---------+-----+
| SubTribe|count|
+---------+-----+
| Chill| 10|
| Cool| 18|
|Adventure| 18|
| Quirk| 13|
| Mystery| 25|
| Party| 18|
|Glamorous| 13|
+---------+-----+
Run Code Online (Sandbox Code Playgroud) 在 Spark 数据框中,我有 1 列包含列表列表作为行。我想将字符串列表合并为一个。
INPUT DATAFRAME:
+-------+--------------------+
| name |friends |
+-------+--------------------+
| Jim |[["C","A"]["B","C"]]|
+-------+--------------------+
| Bill |[["E","A"]["F","L"]]|
+-------+--------------------+
| Kim |[["C","K"]["L","G"]]|
+-------+--------------------+
OUTPUT DATAFRAME:
+-------+--------------------+
| name |friends |
+-------+--------------------+
| Jim |["C","A","B"] |
+-------+--------------------+
| Bill |["E","A","F","L"] |
+-------+--------------------+
| Kim |["C","K","L","G"] |
+-------+--------------------+
Run Code Online (Sandbox Code Playgroud)
我想将列表列表合并为单个列表并删除重复项。提前致谢