我有一组调查回复,受访者可以选择零或多个选项来回答"你喜欢什么类型的水果?"这个问题.还有一个写入答案的空间.在结果电子表格中,每个人的回答都在一个单元格中,不同类型的水果用逗号分隔,如下所示:
(df <- data.frame(id = c("A", "B", "C", "D", "E"),
data = c("oranges, apples, peaches, cherries, pineapples, strawberries",
"oranges, peaches, pears",
"pears, nectarines, cherries (bing, rainier)",
"apples, peaches, nectarines",
""),
stringsAsFactors = FALSE))
# id data
# 1 A oranges, apples, peaches, cherries, pineapples, strawberries
# 2 B oranges, peaches, pears
# 3 C pears, nectarines, cherries (bing, rainier)
# 4 D apples, peaches, nectarines
# 5 E
Run Code Online (Sandbox Code Playgroud)
我想要做的是将响应分成一个长格式表,我几乎用底部的代码完成了.但是,有些受访者在写入回复中包含逗号,我不想将其答案分成逗号.我知道所有原始的多项选择是什么; 我怎么能只拆分这些答案,让写入(用逗号)完好无损?我想最终得到这样的数据框:
id data
1 A oranges
2 A apples
3 A peaches
4 A cherries, pineapples, strawberries
5 B oranges
6 B peaches
7 B pears
8 C pears
9 C nectarines
10 C cherries (bing, rainier)
11 D apples
12 D peaches
13 D nectarines
Run Code Online (Sandbox Code Playgroud)
多项选择包括:
mc_answers <- c("oranges", "plums", "apples", "peaches", "pears", "nectarines")
Run Code Online (Sandbox Code Playgroud)
到目前为止我所取得的成就是:
# use strsplit to create a list of the types of fruit each person likes
datalist <- strsplit(df$data, ", ")
names(datalist) <- df$id
# remove zero-length list elements (person E doesn't like any fruit)
datalist <- Filter(length, datalist)
# convert list elements to data frames
datalist_dfs <- lapply(datalist, data.frame, stringsAsFactors = FALSE)
datalist_dfs <- lapply(datalist_dfs, setNames, "data") # name each column 'data'
# add id column to each data frame
data_long <- mapply(function(x, y) "[<-"(x, "id", value = y), datalist_dfs,
names(datalist_dfs), SIMPLIFY = FALSE)
# combine into one big data frame
(data_per_person <- do.call('rbind', data_long))
# data id
# A.1 oranges A
# A.2 apples A
# A.3 peaches A
# A.4 cherries A # should
# A.5 pineapples A # be one
# A.6 strawberries A # entry
# B.1 oranges B
# B.2 peaches B
# B.3 pears B
# C.1 pears C
# C.2 nectarines C
# C.3 cherries (bing C # should be
# C.4 rainier) C # one entry
# D.1 apples D
# D.2 peaches D
# D.3 nectarines D
Run Code Online (Sandbox Code Playgroud)
一个人可以选择多少水果没有规则,但如果有写入答案,它总是最后一个.
像这样的事情怎么样
do.call(rbind, lapply(split(df, df$id), function(x) {
v<-unlist(strsplit(x$data, ",\\s?"))
v<-c(v[v %in% mc_answers], paste(v[!v %in% mc_answers], collapse=", "))
v<-v[nchar(v)>0]
if (length(v)>0) {
data.frame(id=x$id[1], data=v)
} else {
NULL
}
}))
Run Code Online (Sandbox Code Playgroud)
这里我们拆分来分别处理每个组,然后进行字符串拆分。然后我们折叠所有不在mc_answers向量中的条目。它返回
id data
A.1 A oranges
A.2 A apples
A.3 A peaches
A.4 A cherries, pineapples, strawberries
B.1 B oranges
B.2 B peaches
B.3 B pears
C.1 C pears
C.2 C nectarines
C.3 C cherries (bing, rainier)
D.1 D apples
D.2 D peaches
D.3 D nectarines
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
164 次 |
| 最近记录: |