如何在 R 中查找数据帧的字符串行的最大值和最小值?

iGa*_*ada 8 string r dataframe

对于数据的每一行,我想获取最小值最大值以及最初存储为字符的年数。例如,考虑以下数据:

df <- data.frame(id = 1:4,
                 yr = c("1543,860,2023",
                        "2019,2018,2006,2007",
                        "1998,2012,2000,2020",
                        "2000"))
Run Code Online (Sandbox Code Playgroud)

所需输出:

id                   yr  min_yr  max_yr  nYears
 1        1543,860,2023     860    2023       3
 2  2019,2018,2006,2007    2006    2019       4
 3  1998,2012,2000,2020    1998    2020       4
 4                 2000    2000    2000       1
Run Code Online (Sandbox Code Playgroud)

GKi*_*GKi 8

您可以使用strsplit来分割字符,
\n要迭代结果列表,您可以使用sapply.
\nas.integer将字符串转换为整数。在这种情况下,也许as.numeric是一个替代方案,但由于您只有年数,整数可能是更好的选择。
\n随着range您获得最小值最大值以及length年数。
\nt转置结果,可用于将列min_yrmax_yrnYears添加到data.frameusing[<-

\n
df[c("min_yr", "max_yr", "nYears")] <-\n  t(sapply(strsplit(df$yr, ","), \\(x) c(range(as.integer(x)), length(x))))\n\ndf\n#  id                  yr min_yr max_yr nYears\n#1  1       1543,860,2023    860   2023      3\n#2  2 2019,2018,2006,2007   2006   2019      4\n#3  3 1998,2012,2000,2020   1998   2020      4\n#4  4                2000   2000   2000      1\n\nstr(df)\n#'data.frame':   4 obs. of  5 variables:\n# $ id    : int  1 2 3 4\n# $ yr    : chr  "1543,860,2023" "2019,2018,2006,2007" "1998,2012,2000,2020" "2000"\n# $ min_yr: int  860 2006 1998 2000\n# $ max_yr: int  2023 2019 2020 2000\n# $ nYears: int  3 4 4 1\n
Run Code Online (Sandbox Code Playgroud)\n

基本相同,但不是一回合。

\n
. <- lapply(strsplit(df$yr, ","), as.integer) #Split it by "," and convert it to integer\ndf$min_yr <- vapply(., min, integer(1)) #Get the minimum value\ndf$max_yr <- vapply(., max, integer(1)) #Get the minimum value\ndf$nYears <- lengths(.)\n
Run Code Online (Sandbox Code Playgroud)\n
\n

仅针对最小值最大值进行基准测试(原始问题)。
\n@user2974951 由于按字符排序而失败,@ak​​run 期望相同的年数。

\n
set.seed(42)\ndf <- data.frame(yr = replicate(1e5, paste(sample(0:2023, sample(1:100, 1)), collapse=",")))\n\nlibrary(stringr) #for Ma\xc3\xabl\nlibrary(dplyr)  #for Stefano Barbi and Ma\xc3\xabl\nlibrary(purrr)  #for Stefano Barbi\n\nbench::mark(check = FALSE,\n"Ma\xc3\xabl" = {df %>% \n            rowwise() %>% \n            mutate(min_yr = min(as.numeric(str_split_1(yr, ","))),\n                   max_yr = max(as.numeric(str_split_1(yr, ","))))},\n"Allan Cameron" = local({df[c('min_yr', 'max_yr')] <- t(sapply(df$yr, \\(x) range(scan(text=x, sep = ',')))); df}),\n"Stefano Barbi" = {mutate(df, strsplit(yr, ",") |>\n                          map(as.numeric) |>\n                          map(range) |>\n                          map_dfr(setNames, c("min", "max")))},\nuser2974951 = local({df$min_yr=as.numeric(unlist(lapply(strsplit(df$yr,","),min)))\n  df$max_yr=as.numeric(unlist(lapply(strsplit(df$yr,","),max)))\n  df}),\n#akrun = local({d1 <- read.csv(text = df$yr, header = FALSE) #Fails\n#  df$min_yr <- do.call(pmin, d1)\n#  df$max_yr <- do.call(pmax, d1)\n#  df}),\nGKi = local({df[c("min_yr", "max_yr")] <-\n               t(sapply(strsplit(df$yr, ","), \\(x) c(range(as.integer(x)))))\n            df}),\nGKi2 = local({. <- lapply(strsplit(df$yr, ","), as.integer)\n  df$min_yr <- vapply(., min, integer(1))\n  df$max_yr <- vapply(., max, integer(1))\n  df})\n)\n
Run Code Online (Sandbox Code Playgroud)\n

结果

\n
  expression       min median itr/s\xe2\x80\xa6\xc2\xb9 mem_a\xe2\x80\xa6\xc2\xb2 gc/se\xe2\x80\xa6\xc2\xb3 n_itr  n_gc total\xe2\x80\xa6\xe2\x81\xb4 result\n  <bch:expr>    <bch:> <bch:>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>\n1 Ma\xc3\xabl           6.23s  6.23s   0.161 242.7MB   4.66      1    29   6.23s <NULL>\n2 Allan Cameron  3.74s  3.74s   0.267 867.4MB   6.69      1    25   3.74s <NULL>\n3 Stefano Barbi  3.45s  3.45s   0.290 135.3MB   4.06      1    14   3.45s <NULL>\n4 user2974951    2.59s  2.59s   0.387  88.9MB   0.387     1     1   2.59s <NULL>\n5 GKi            1.44s  1.44s   0.694  95.8MB   1.39      1     2   1.44s <NULL>\n6 GKi2           1.27s  1.27s   0.790  64.2MB   0.790     1     1   1.27s <NULL>\n
Run Code Online (Sandbox Code Playgroud)\n

在这种情况下,GKi2(一步一步进行)在速度和内存消耗方面似乎是最好的。

\n


All*_*ron 5

这是 R 基数中的一行代码,也适用于任何数字。

df[c('min_yr', 'max_yr')] <- t(sapply(df$yr, \(x) range(scan(text=x, sep = ','))))
Run Code Online (Sandbox Code Playgroud)

导致

df
#>   id                  yr min_yr max_yr
#> 1  1 2000,2009,1999,2022   1999   2022
#> 2  2 2019,2018,2006,2007   2006   2019
#> 3  3 1998,2012,2000,2020   1998   2020
Run Code Online (Sandbox Code Playgroud)