我想创建一个升级版本,当我们尝试组合的 dfs 中存在因子列(也可能有非因子列)时,dplyr::bind_rows
可以避免出现警告。Unequal factor levels: coercing to character
这是一个例子:
df1 <- dplyr::data_frame(age = 1:3, gender = factor(c("male", "female", "female")), district = factor(c("north", "south", "west")))
df2 <- dplyr::data_frame(age = 4:6, gender = factor(c("male", "neutral", "neutral")), district = factor(c("central", "north", "east")))
Run Code Online (Sandbox Code Playgroud)
然后bind_rows_with_factor_columns(df1, df2)
返回(没有警告):
dplyr::data_frame(
age = 1:6,
gender = factor(c("male", "female", "female", "male", "neutral", "neutral")),
district = factor(c("north", "south", "west", "central", "north", "east"))
)
Run Code Online (Sandbox Code Playgroud)
这是我到目前为止所拥有的:
bind_rows_with_factor_columns <- function(...) {
factor_columns <- purrr::map(..., function(df) {
colnames(dplyr::select_if(df, is.factor))
}) …
Run Code Online (Sandbox Code Playgroud) 也许仅仅是因为我对API不太熟悉,但是我觉得Spark ML方法经常返回不必要地难以使用的DF。
这次,是ALS模型让我大跌眼镜。具体来说,为recommendedForAllUsers方法。让我们重构将返回的DF的类型:
scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))
scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
toDF("userId", "recommendations").
select($"userId", $"recommendations".cast(arrayType))
scala> recs.show()
Run Code Online (Sandbox Code Playgroud)
+------+------------------+
|userId| recommendations|
+------+------------------+
| 1|[[1,0.7], [2,0.5]]|
| 2|[[0,0.9], [4,0.1]]|
+------+------------------+
Run Code Online (Sandbox Code Playgroud)
+------+------------------+
|userId| recommendations|
+------+------------------+
| 1|[[1,0.7], [2,0.5]]|
| 2|[[0,0.9], [4,0.1]]|
+------+------------------+
Run Code Online (Sandbox Code Playgroud)
root
|-- userId: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemId: integer (nullable = …
Run Code Online (Sandbox Code Playgroud) 我有以下三个值.我想要做的是创建一个长度相同的向量lookup
,其中NA
元素的值为replace
,而其余的元素在相应的位置给出值data
.例如
lookup = c(NA, NA, 1, 2, NA, 3, NA)
data = c("user_3", "user_4", "user_6")
replace = "no_user_data"
Run Code Online (Sandbox Code Playgroud)
那么期望的输出将是: c("no_user_data", "no_user_data", "user_3", "user_4", "no_user_data", "user_6", "no_user_data")
这里的关键约束是我希望tidyverse
尽可能地利用.我目前的解决方案如下:
data <- c(data, replace)
lookup[is.na(lookup)] <- length(data)
data <- data[lookup]
Run Code Online (Sandbox Code Playgroud)
我相信借助一些tidyverse
魔法可以看起来好多了.谢谢!