小编Ric*_*c S的帖子

用相同的键合并两个dicts

我有以下两个玩具序列

d1 = {
 'a': [2,4,5,6,8,10],
 'b': [1,2,5,6,9,12],
 'c': [0,4,5,8,10,21]
 }
d2 = {
 'a': [12,15],
 'b': [14,16],
 'c': [23,35]
  }

Run Code Online (Sandbox Code Playgroud)

我想得到一个独特的字典,我在第一个字典值之后堆叠第二个字典值,在同一个方括号内.

我尝试了以下代码

d_comb = {key:[d1[key], d2[key]] for key in d1}

Run Code Online (Sandbox Code Playgroud)

但是我获得的输出在每个键的列表中有两个列表,即

{'a': [[2, 4, 5, 6, 8, 10], [12, 15]],
 'b': [[1, 2, 5, 6, 9, 12], [14, 16]],
 'c': [[0, 4, 5, 8, 10, 21], [23, 35]]}

Run Code Online (Sandbox Code Playgroud)

而我想获得

{'a': [2, 4, 5, 6, 8, 10, 12, 15],
 'b': [1, 2, 5, 6, 9, 12, 14, 16],
 'c': [0, …

Run Code Online (Sandbox Code Playgroud)

python dictionary list

Ric*_*c S

2019 01-09

24
推荐指数

4
解决办法

1284
查看次数

Spark中的MultiLabelBinarizer？

我正在寻找等效的变压器，例如MultiLabelBinarizerin sklearn。

到目前为止，我发现的只是这Binarizer并不能真正满足我的需要。

我也在看这个文档，但我看不到任何我想要的东西。

我的输入是一个列，其中每个元素都是一个标签列表：

labels    
['a', 'b']
['a']
['c', 'b']
['a', 'c']

Run Code Online (Sandbox Code Playgroud)

输出应该是

labels
[1, 1, 0]
[1, 0, 0]
[0, 1, 1]
[1, 0, 1]

Run Code Online (Sandbox Code Playgroud)

PySpark 相当于什么？

python machine-learning apache-spark pyspark

dis*_*ame

2021 08-23

8
推荐指数

1
解决办法

564
查看次数

使用 dplyr [r] 标准化变量

我想标准化 R 中的变量。我知道如何做到这一点的多种方法。然而，我真的很喜欢使用下面这种方法：

library(tidyverse)

df <- mtcars

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = key, values_from = value)

Run Code Online (Sandbox Code Playgroud)

由于某种原因，这种方法不起作用，因为我无法将数据返回到原始格式。所以想请教一下

r dplyr tidyr tidyverse

Pet*_*etr

2020 07-03

8
推荐指数

2
解决办法

8334
查看次数

根据每组的特定行计算 R 中行之间的差异

大家好，我有一个数据框，其中每个 ID 都有 1-5 次多次访问。我正在尝试计算每次访问与访问 1 之间的分数差异。（分数（Visit 5-score（Visit1）等）。我如何在 R 中实现这一目标？下面是示例数据集和结果数据集

structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B"), 
    Visit = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), Score = c(16, 
        15, 13, 12, 12, 20, 19, 18)), class = "data.frame", row.names = c(NA, 
    -8L))

#>   ID Visit Score
#> 1  A     1    16
#> 2  A     2    15
#> 3  A     3    13
#> 4  A     4    12
#> 5  A     5    12
#> 6  B     1 …

Run Code Online (Sandbox Code Playgroud)

r data-manipulation dataframe difference dplyr

Dat*_*iac

2021 05-20

6
推荐指数

1
解决办法

623
查看次数

查询文本指定use_legacy_sql:false,而API选项指定:true

我正在使用带有bigrquery的standardSQL:

library(bigrquery)
project <- "</project-name>"


sql <- "
#standardSQL
SELECT
</sql-query>;"


result <- query_exec(sql, project = project, useLegacySql = FALSE)

Run Code Online (Sandbox Code Playgroud)

当我运行R脚本时,我收到以下错误:

 "Error: Query text specifies use_legacy_sql:false, while API options specify:true"

Run Code Online (Sandbox Code Playgroud)

任何想法可能会发生在这里？

r google-bigquery bigrquery

Ror*_*ton

2019 04-03

5
推荐指数

1
解决办法

1541
查看次数

R 中 distm 函数或 distVincentyEllipsoid 之间的区别

distm您能完整解释一下R中使用函数或distVincentyEllipsoid函数计算测地坐标距离的巨大差异吗？

我注意到使用 distm 进行此计算需要更长的时间。您能否向我解释一下除了差异之外，为什么会发生这种情况？

谢谢你！

r distance

Ant*_*nio

2020 05-29

5
推荐指数

1
解决办法

280
查看次数

在 mutate 语句中动态引用列名 - dplyr

我对这个长问题表示歉意，但很长一段时间后我自己也找不到解决方案。

我有这个玩具数据框

set.seed(23)
df <- tibble::tibble(
  id = paste0("00", 1:6),
  cond = c(1, 1, 2, 2, 3, 3),
  A_1 = sample(0:9, 6, replace = TRUE), A_2 = sample(0:9, 6, replace = TRUE), A_3 = sample(0:9, 6, replace = TRUE),
  B_1 = sample(0:9, 6, replace = TRUE), B_2 = sample(0:9, 6, replace = TRUE), B_3 = sample(0:9, 6, replace = TRUE),
  C_1 = sample(0:9, 6, replace = TRUE), C_2 = sample(0:9, 6, replace = TRUE), C_3 = sample(0:9, 6, replace = …

Run Code Online (Sandbox Code Playgroud)

r dplyr across

Ric*_*c S

2020 06-24

5
推荐指数

1
解决办法

749
查看次数

在单个 mutate() 语句中组合多个 cross()，同时控制 R 中的变量名称

我有以下数据框：

df = data.frame(a = 10, b = 20, a_sd = 2, b_sd = 3)

   a  b a_sd b_sd
1 10 20    2    3

Run Code Online (Sandbox Code Playgroud)

我想计算 a/a_sd、b/b_sd，并将结果添加到数据帧中，并将它们命名为 ratio_a、ratio_b。在我的数据框中，我有很多变量，所以我需要一个“广泛”的解决方案。我试过：

df %>% 
  mutate( across(one_of( c('a','b')))/across(ends_with('_sd')))

Run Code Online (Sandbox Code Playgroud)

这给了：

  a        b a_sd b_sd
1 5 6.666667    2    3

Run Code Online (Sandbox Code Playgroud)

所以这行得通，但新值取代了旧值。如何将结果添加到数据框中并控制新名称？

r dplyr tidyverse mutate across

Rti*_*ist

2020 11-26

4
推荐指数

1
解决办法

48
查看次数

创建动态分组依据

df = data.frame(
  A = c(1, 4, 5, 13, 2),
  B = c("Group 1", "Group 3", "Group 2", "Group 1", "Group 2"),
  C = c("Group 3", "Group 2", "Group 1", "Group 2", "Group 3")
)

df %>%
  group_by(B) %>%
  summarise(val = mean(A))

df %>%
  group_by(C) %>%
  summarise(val = mean(A))

Run Code Online (Sandbox Code Playgroud)

group_by我不想为每个唯一的一组代码编写新的代码块，而是创建一个循环来遍历df数据帧并将结果保存到列表或数据帧中。

我想看看特征A的平均值如何分布在特征B和C 上，而不必为数据集中的每个分类特征编写新的代码块。

我试过这个：

List_Of_Groups <- map_df(df, function(i) {
  df %>% 
    group_by(!!!syms(names(df)[1:i])) %>% 
    summarize(newValue = mean(A))
})

Run Code Online (Sandbox Code Playgroud)

r dataframe dplyr purrr tidyverse

Lon*_*car

2020 06-23

3
推荐指数

1
解决办法

142
查看次数

无法在 pyspark 数据帧上使用 Sklearn 模型进行预测

我已成功加载 sklearn 模型，但无法对 pyspark 数据帧进行预测。运行下面给定的代码时，出现下面提到的错误。请帮助我获取在 pyspark 上使用 sklearn 模型进行预测的代码。我也搜索过相关问题，但没有找到解决方案。

sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value


#update prediction method
def predictor(cols):
    #call predict method for model
    return model.value.predict(*cols)

udf_predictor = udf(predictor, FloatType())

#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))

Run Code Online (Sandbox Code Playgroud)

我收到以下错误消息

TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.

Run Code Online (Sandbox Code Playgroud)

python machine-learning prediction scikit-learn pyspark

Muh*_*bir

2022 03-17

3
推荐指数

1
解决办法

3441
查看次数

标签统计

r ×7

dplyr ×5

python ×3

tidyverse ×3

across ×2

dataframe ×2

machine-learning ×2

pyspark ×2

apache-spark ×1

bigrquery ×1

data-manipulation ×1

dictionary ×1

difference ×1

distance ×1

google-bigquery ×1

list ×1

mutate ×1

prediction ×1

purrr ×1

scikit-learn ×1

tidyr ×1

标签 统计

小编Ric_c S的帖子

标签统计