iGa*_*ada 8 string r dataframe
对于数据的每一行,我想获取最小值和最大值以及最初存储为字符的年数。例如,考虑以下数据:
df <- data.frame(id = 1:4,
yr = c("1543,860,2023",
"2019,2018,2006,2007",
"1998,2012,2000,2020",
"2000"))
Run Code Online (Sandbox Code Playgroud)
所需输出:
id yr min_yr max_yr nYears
1 1543,860,2023 860 2023 3
2 2019,2018,2006,2007 2006 2019 4
3 1998,2012,2000,2020 1998 2020 4
4 2000 2000 2000 1
Run Code Online (Sandbox Code Playgroud)
您可以使用strsplit来分割字符,。
\n要迭代结果列表,您可以使用sapply.
\nas.integer将字符串转换为整数。在这种情况下,也许as.numeric是一个替代方案,但由于您只有年数,整数可能是更好的选择。
\n随着range您获得最小值和最大值以及length年数。
\nt转置结果,可用于将列min_yr、max_yr和nYears添加到data.frameusing[<-
df[c("min_yr", "max_yr", "nYears")] <-\n t(sapply(strsplit(df$yr, ","), \\(x) c(range(as.integer(x)), length(x))))\n\ndf\n# id yr min_yr max_yr nYears\n#1 1 1543,860,2023 860 2023 3\n#2 2 2019,2018,2006,2007 2006 2019 4\n#3 3 1998,2012,2000,2020 1998 2020 4\n#4 4 2000 2000 2000 1\n\nstr(df)\n#'data.frame': 4 obs. of 5 variables:\n# $ id : int 1 2 3 4\n# $ yr : chr "1543,860,2023" "2019,2018,2006,2007" "1998,2012,2000,2020" "2000"\n# $ min_yr: int 860 2006 1998 2000\n# $ max_yr: int 2023 2019 2020 2000\n# $ nYears: int 3 4 4 1\nRun Code Online (Sandbox Code Playgroud)\n基本相同,但不是一回合。
\n. <- lapply(strsplit(df$yr, ","), as.integer) #Split it by "," and convert it to integer\ndf$min_yr <- vapply(., min, integer(1)) #Get the minimum value\ndf$max_yr <- vapply(., max, integer(1)) #Get the minimum value\ndf$nYears <- lengths(.)\nRun Code Online (Sandbox Code Playgroud)\n仅针对最小值和最大值进行基准测试(原始问题)。
\n@user2974951 由于按字符排序而失败,@akrun 期望相同的年数。
set.seed(42)\ndf <- data.frame(yr = replicate(1e5, paste(sample(0:2023, sample(1:100, 1)), collapse=",")))\n\nlibrary(stringr) #for Ma\xc3\xabl\nlibrary(dplyr) #for Stefano Barbi and Ma\xc3\xabl\nlibrary(purrr) #for Stefano Barbi\n\nbench::mark(check = FALSE,\n"Ma\xc3\xabl" = {df %>% \n rowwise() %>% \n mutate(min_yr = min(as.numeric(str_split_1(yr, ","))),\n max_yr = max(as.numeric(str_split_1(yr, ","))))},\n"Allan Cameron" = local({df[c('min_yr', 'max_yr')] <- t(sapply(df$yr, \\(x) range(scan(text=x, sep = ',')))); df}),\n"Stefano Barbi" = {mutate(df, strsplit(yr, ",") |>\n map(as.numeric) |>\n map(range) |>\n map_dfr(setNames, c("min", "max")))},\nuser2974951 = local({df$min_yr=as.numeric(unlist(lapply(strsplit(df$yr,","),min)))\n df$max_yr=as.numeric(unlist(lapply(strsplit(df$yr,","),max)))\n df}),\n#akrun = local({d1 <- read.csv(text = df$yr, header = FALSE) #Fails\n# df$min_yr <- do.call(pmin, d1)\n# df$max_yr <- do.call(pmax, d1)\n# df}),\nGKi = local({df[c("min_yr", "max_yr")] <-\n t(sapply(strsplit(df$yr, ","), \\(x) c(range(as.integer(x)))))\n df}),\nGKi2 = local({. <- lapply(strsplit(df$yr, ","), as.integer)\n df$min_yr <- vapply(., min, integer(1))\n df$max_yr <- vapply(., max, integer(1))\n df})\n)\nRun Code Online (Sandbox Code Playgroud)\n结果
\n expression min median itr/s\xe2\x80\xa6\xc2\xb9 mem_a\xe2\x80\xa6\xc2\xb2 gc/se\xe2\x80\xa6\xc2\xb3 n_itr n_gc total\xe2\x80\xa6\xe2\x81\xb4 result\n <bch:expr> <bch:> <bch:> <dbl> <bch:b> <dbl> <int> <dbl> <bch:t> <list>\n1 Ma\xc3\xabl 6.23s 6.23s 0.161 242.7MB 4.66 1 29 6.23s <NULL>\n2 Allan Cameron 3.74s 3.74s 0.267 867.4MB 6.69 1 25 3.74s <NULL>\n3 Stefano Barbi 3.45s 3.45s 0.290 135.3MB 4.06 1 14 3.45s <NULL>\n4 user2974951 2.59s 2.59s 0.387 88.9MB 0.387 1 1 2.59s <NULL>\n5 GKi 1.44s 1.44s 0.694 95.8MB 1.39 1 2 1.44s <NULL>\n6 GKi2 1.27s 1.27s 0.790 64.2MB 0.790 1 1 1.27s <NULL>\nRun Code Online (Sandbox Code Playgroud)\n在这种情况下,GKi2(一步一步进行)在速度和内存消耗方面似乎是最好的。
\n这是 R 基数中的一行代码,也适用于任何数字。
df[c('min_yr', 'max_yr')] <- t(sapply(df$yr, \(x) range(scan(text=x, sep = ','))))
Run Code Online (Sandbox Code Playgroud)
导致
df
#> id yr min_yr max_yr
#> 1 1 2000,2009,1999,2022 1999 2022
#> 2 2 2019,2018,2006,2007 2006 2019
#> 3 3 1998,2012,2000,2020 1998 2020
Run Code Online (Sandbox Code Playgroud)