PySpark 在嵌套数组中反转 StringIndexer

Question

PySpark 在嵌套数组中反转 StringIndexer

Dan*_*ero 6 python apache-spark apache-spark-sql pyspark apache-spark-ml

我正在使用 PySpark 使用 ALS 进行协作过滤。我的原始用户和项目 ID 是字符串，因此我过去常常StringIndexer将它们转换为数字索引（PySpark 的 ALS 模型要求我们这样做）。

安装模型后，我可以获得每个用户的前 3 个推荐，如下所示：

recs = (
    model
    .recommendForAllUsers(3)
)

Run Code Online (Sandbox Code Playgroud)

数据recs框看起来像这样：

+-----------+--------------------+
|userIdIndex|     recommendations|
+-----------+--------------------+
|       1580|[[10096,3.6725707...|
|       4900|[[10096,3.0137873...|
|       5300|[[10096,2.7274625...|
|       6620|[[10096,2.4493625...|
|       7240|[[10096,2.4928937...|
+-----------+--------------------+
only showing top 5 rows

root
 |-- userIdIndex: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- productIdIndex: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

Run Code Online (Sandbox Code Playgroud)

我想用这个数据框创建一个巨大的 JSOM 转储，我可以这样：

(
    recs
    .toJSON()
    .saveAsTextFile("name_i_must_hide.recs")
)

Run Code Online (Sandbox Code Playgroud)

这些 json 的示例是：

{
  "userIdIndex": 1580,
  "recommendations": [
    {
      "productIdIndex": 10096,
      "rating": 3.6725707
    },
    {
      "productIdIndex": 10141,
      "rating": 3.61542
    },
    {
      "productIdIndex": 11591,
      "rating": 3.536216
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

和键是由于变换造成的userIdIndex。productIdIndexStringIndexer

我怎样才能恢复这些列的原始值？我怀疑我必须使用IndexToString变压器，但我不太清楚如何使用变压器，因为数据嵌套在数据框内的数组中recs。

我尝试使用Pipeline评估器 ( stages=[StringIndexer, ALS, IndexToString])，但该评估器似乎不支持这些索引器。

干杯!

Answer 1

zer*_*323 5

在这两种情况下，您都需要访问标签列表。可以使用以下任一方式访问此内容：StringIndexerModel

user_indexer_model = ...  # type: StringIndexerModel
user_labels = user_indexer_model.labels

product_indexer_model = ...  # type: StringIndexerModel
product_labels = product_indexer_model.labels

Run Code Online (Sandbox Code Playgroud)

或列元数据。

您只需userIdIndex申请IndexToString：

from pyspark.ml.feature import IndexToString

user_id_to_label = IndexToString(
    inputCol="userIdIndex", outputCol="userId", labels=user_labels)
user_id_to_label.transform(recs)

Run Code Online (Sandbox Code Playgroud)

对于建议，您需要udf这样的或表达式：

from pyspark.sql.functions import array, col, lit, struct

n = 3  # Same as numItems

product_labels_ = array(*[lit(x) for x in product_labels])
recommendations = array(*[struct(
    product_labels_[col("recommendations")[i]["productIdIndex"]].alias("productId"),
    col("recommendations")[i]["rating"].alias("rating")
) for i in range(n)])

recs.withColumn("recommendations", recommendations)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，6 月前
查看次数：	3444 次
最近记录：	3 年，8 月前