我相信我的问题与最佳实践一样多,因为它是关于整理凌乱的数据,所以这里也是如此.
以下是数据框的摘录,这lang.df是学校范围内的学生数据集.该列Langauge.Home表示家长对问题的回答:"您在家里说什么语言?"
> lang.df
Nationality Language.Home
1 HK Mandarin
2 German Mandarin/English/German
3 Saudi Arabic
4 Norwegian Norwegian
5 UK English
6 HK Mandarin/ Min Nan dialect
7 Australian Mandarin
8 HK Mandarin
9 Brazilian Portuguese/English
10 Indian Hindi/English
Run Code Online (Sandbox Code Playgroud)
很明显,这是获取此信息的一种糟糕方式,也是一种存储它的不良方式,但我的工作是使用我拥有的数据.
结果
我想探讨某些家庭语言可能对成就产生的影响.我需要的是能够通过在家里说的单一语言(例如在家里说英语的学生)进行分组.
为此,我似乎必须使用dplyr 将Language@home列分隔为三("language.home1", "language.home2", "language.home3")separate().为我创建的新列中的每个唯一值(即语言)创建一个新列
处理
以下是我有效地尝试上述操作
library(dplyr)
library(tidyr)
#separate Langauge.Home into three new columns
lang.df <- lang.df %>% separate(Language.Home,
c("language.home1", "language.home2", "language.home3"),
sep = "/",
remove = FALSE)
#find distinct languages & remove NAs
langs <- unique(c(lang.df$language.home1,
lang.df$language.home2,
lang.df$language.home3))
langs <- langs[!is.na(langs)]
#create boolean column for each unique language in new columns
for (i in langs) {
lang.df[,paste(i)] <- grepl(i, lang.df$Language.Home)
}
Run Code Online (Sandbox Code Playgroud)
问题
tidyr文档,并在这里查看,但无法找到任何相关信息.在此先感谢您的帮助.我现在只使用R开关大约一年了,这是我的第一篇SO帖子.给我尽可能多的反馈!
数据
lang.df <- structure(list(Nationality = structure(c(4L, 3L, 7L, 6L, 8L,
4L, 1L, 4L, 2L, 5L), .Label = c("Australian", "Brazilian", "German",
"HK", "Indian", "Norwegian", "Saudi", "UK"), class = "factor"),
`Language.Home` = structure(c(4L, 6L, 1L, 7L, 2L, 5L, 4L,
4L, 8L, 3L), .Label = c("Arabic", "English", "Hindi/English",
"Mandarin", "Mandarin/ Min Nan dialect", "Mandarin/English/German",
"Norwegian", "Portuguese/English"), class = "factor")), row.names = c(NA,
10L), .Names = c("Nationality", "Language.Home"), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
我们可以使用cSplitfrom splitstackshape来使用分隔符拆分'Language.Home' /并将其转换为long格式.
library(splitstackshape)
library(data.table)
dt <- cSplit(lang.df, "Language.Home", "/", "long")
Run Code Online (Sandbox Code Playgroud)
然后,用于dcast从'long'转换为'wide'
dcast(dt, Nationality~Language.Home, fun.aggregate = function(x) length(x)>0)
Run Code Online (Sandbox Code Playgroud)
注意:有重复的"国籍"行,因此上面将共同的元素组合在一起.将它组合在一起可能更好.
如果我们需要基于每一行的逻辑列(不论类似的'国籍')
dcast(cSplit(setDT(lang.df, keep.rownames=TRUE), "Language.Home",
"/", "long"), rn +Nationality ~Language.Home, function(x) length(x) >0)
Run Code Online (Sandbox Code Playgroud)
或者另一种选择是mtabulate从qdapTools通过拆分"Language.Home"之后/.
library(qdapTools)
cbind(lang.df, !!(mtabulate(setNames(strsplit(as.character(lang.df$Language.Home),
"/"), lang.df$Nationality))))
# Nationality Language.Home Min Nan dialect Arabic English German Hindi Mandarin Norwegian Portuguese
#1 HK Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#2 German Mandarin/English/German FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
#3 Saudi Arabic FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#4 Norwegian Norwegian FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#5 UK English FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#6 HK Mandarin/ Min Nan dialect TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#7 Australian Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#8 HK Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#9 Brazilian Portuguese/English FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#10 Indian Hindi/English FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
Run Code Online (Sandbox Code Playgroud)