我想计算列中每个组的字符串出现次数.在这种情况下,字符串通常是字符列中的子字符串.
我有一些数据,例如
ID String village
1 fd_sec, ht_rm, A
2 NA, ht_rm A
3 fd_sec, B
4 san, ht_rm, C
Run Code Online (Sandbox Code Playgroud)
我开始的代码显然是不正确的,但我没有在我的搜索中发现我可以使用列中的grep函数和按村一组
impacts <- se %>% group_by(village) %>%
summarise(c_NA = round(sum(sub$en41_1 == "NA")),
c_ht_rm = round(sum(sub$en41_1 == "ht_rm")),
c_san = round(sum(sub$en41_1 == "san")),
c_fd_sec = round(sum(sub$en41_1 == "fd_sec")))
Run Code Online (Sandbox Code Playgroud)
理想情况下,我的输出将是:
village fd_sec NA ht_rm san
A 1 1 2
B 1
C 1 1
Run Code Online (Sandbox Code Playgroud)
先感谢您
我们可以做到这一点base R通过split廷“村”,然后通过分裂分拆“串”成子在“字符串”列,后跟零个或多个空格(\\s*),stack将list进入一个两列data.frame,并获得与频率table
table(stack(lapply(split(df1$String, df1$village),
function(x) unlist(strsplit(x, ",\\s*"))))[2:1])
# values
#ind fd_sec ht_rm NA san
# A 1 2 1 0
# B 1 0 0 0
# C 0 1 0 1
Run Code Online (Sandbox Code Playgroud)
或使用tidyverse,按“村”分组后,通过使用分割“字符串” separate_rows,将“字符串”整形为“长”格式,将“字符串” filter中具有空白值的行,count频率及其spread更改为“宽”格式
library(dplyr)
library(tidyr)
df1 %>%
group_by(village) %>%
separate_rows(String, sep=",\\s*") %>%
filter(nzchar(String)) %>%
count(village, String) %>%
spread(String, n, fill = 0)
# A tibble: 3 x 5
# Groups: village [3]
# village fd_sec ht_rm `NA` san
#* <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 1.00 2.00 1.00 0
#2 B 1.00 0 0 0
#3 C 0 1.00 0 1.00
Run Code Online (Sandbox Code Playgroud)
您也可以使用cSplit()我的"splitstackshape"包.由于此包还加载"data.table",因此您只需使用dcast()表格结果.
例:
library(splitstackshape)
cSplit(mydf, "String", direction = "long")[, dcast(.SD, village ~ String)]
# Using 'village' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# village fd_sec ht_rm san NA
# 1: A 1 2 0 1
# 2: B 1 0 0 0
# 3: C 0 1 1 0
Run Code Online (Sandbox Code Playgroud)