小编Nic*_*ick的帖子

将数据框的行与某些因子列绑定

我想创建一个升级版本,当我们尝试组合的 dfs 中存在因子列(也可能有非因子列)时,dplyr::bind_rows可以避免出现警告。Unequal factor levels: coercing to character这是一个例子:

df1 <- dplyr::data_frame(age = 1:3, gender = factor(c("male", "female", "female")), district = factor(c("north", "south", "west")))
df2 <- dplyr::data_frame(age = 4:6, gender = factor(c("male", "neutral", "neutral")), district = factor(c("central", "north", "east")))
Run Code Online (Sandbox Code Playgroud)

然后bind_rows_with_factor_columns(df1, df2)返回(没有警告):

dplyr::data_frame(
  age = 1:6,
  gender = factor(c("male", "female", "female", "male", "neutral", "neutral")),
  district = factor(c("north", "south", "west", "central", "north", "east"))
)
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止所拥有的:

bind_rows_with_factor_columns <- function(...) {
  factor_columns <- purrr::map(..., function(df) {
      colnames(dplyr::select_if(df, is.factor))
  }) …
Run Code Online (Sandbox Code Playgroud)

r dplyr purrr tidyverse

5
推荐指数
1
解决办法
2888
查看次数

如何展平结构类型数组的列(由Spark ML API返回)?

也许仅仅是因为我对API不太熟悉,但是我觉得Spark ML方法经常返回不必要地难以使用的DF。

这次,是ALS模型让我大跌眼镜。具体来说,为recommendedForAllUsers方法。让我们重构将返回的DF的类型:

scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))

scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
  toDF("userId", "recommendations").
  select($"userId", $"recommendations".cast(arrayType))

scala> recs.show()
Run Code Online (Sandbox Code Playgroud)
+------+------------------+
|userId|   recommendations|
+------+------------------+
|     1|[[1,0.7], [2,0.5]]|
|     2|[[0,0.9], [4,0.1]]|
+------+------------------+
Run Code Online (Sandbox Code Playgroud)
+------+------------------+
|userId|   recommendations|
+------+------------------+
|     1|[[1,0.7], [2,0.5]]|
|     2|[[0,0.9], [4,0.1]]|
+------+------------------+
Run Code Online (Sandbox Code Playgroud)
root
 |-- userId: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- itemId: integer (nullable = …
Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql apache-spark-ml

5
推荐指数
2
解决办法
4297
查看次数

使用tidyverse合并两个向量

我有以下三个值.我想要做的是创建一个长度相同的向量lookup,其中NA元素的值为replace,而其余​​的元素在相应的位置给出值data.例如

lookup = c(NA, NA, 1, 2, NA, 3, NA)
data = c("user_3", "user_4", "user_6")
replace = "no_user_data"
Run Code Online (Sandbox Code Playgroud)

那么期望的输出将是: c("no_user_data", "no_user_data", "user_3", "user_4", "no_user_data", "user_6", "no_user_data")

这里的关键约束是我希望tidyverse尽可能地利用.我目前的解决方案如下:

data <- c(data, replace)
lookup[is.na(lookup)] <- length(data)
data <- data[lookup]
Run Code Online (Sandbox Code Playgroud)

我相信借助一些tidyverse魔法可以看起来好多了.谢谢!

r dplyr tidyr tidyverse

1
推荐指数
1
解决办法
765
查看次数