amu*_*phy 5 variables r data-manipulation
我正在尝试进行一些文本处理,需要重新编码句子的单词,以便在新变量中以特定方式识别目标单词.例如,给定一个看起来像这样的数据框......
subj <- c("1", "1", "1", "2", "2", "2", "2", "2")
condition <- c("A", "A", "A", "B", "B", "B", "B", "B")
sentence <- c("1", "1", "1", "2", "2", "2", "2", "2")
word <- c("I", "like", "dogs.", "We", "don't", "like", "this", "song.")
d <- data.frame(subj,condition, sentence, word)
subj condition sentence word
1 A 1 I
1 A 1 like
1 A 1 dogs.
2 B 2 We
2 B 2 don't
2 B 2 like
2 B 2 this
2 B 2 song.
Run Code Online (Sandbox Code Playgroud)
我需要创建一个新列,其中目标字的每个实例(在此示例中,当d $ word ="like")标记为0,并且句子块中"like"之前的所有单词将减少,并且"之后的所有单词"喜欢"增量.每个主题都有多个句子,句子因条件而异,因此循环需要考虑每个主语,每个句子的目标词的实例.最终结果看起来应该是这样的.
subj condition sentence word position
1 A 1 I -1
1 A 1 like 0
1 A 1 dogs. 1
2 B 2 We -2
2 B 2 don't -1
2 B 2 like 0
2 B 2 this 1
2 B 2 song. 2
Run Code Online (Sandbox Code Playgroud)
对不起,如果问题措辞不好,我希望它有意义!请注意,目标不在每个句子中的相同位置(相对于句子的开头).我对R很新,可以弄清楚如何增加或减少,但不能在每个句子块中做两件事.有关最佳方法的任何建议吗?非常感谢!
您可以添加一个索引,然后可以将其用于相对位置.
使用data.table可以sentence很容易地将其分解
library(data.table)
DT <- data.table(indx=1:nrow(d), d, key="indx")
DT[, position:=(indx - indx[word=="like"]), by=sentence]
# Results
DT
# indx subj condition sentence word position
# 1: 1 1 A 1 I -1
# 2: 2 1 A 1 like 0
# 3: 3 1 A 1 dogs. 1
# 4: 4 2 B 2 We -2
# 5: 5 2 B 2 don't -1
# 6: 6 2 B 2 like 0
# 7: 7 2 B 2 this 1
# 8: 8 2 B 2 song. 2
Run Code Online (Sandbox Code Playgroud)
如果您的语法不正确,您可能希望使用grepl而不是==
DT[, position:=(indx - indx[grepl("like", word)]), by=sentence]
Run Code Online (Sandbox Code Playgroud)
我认为在文本处理中,明智的做法是避免让文本条目成为因素。在这种情况下,我使用了as.character但我建议设置options(stringsAsFactors=FALSE);
d$position <- with( d, ave(as.character(word), sentence,
FUN=function(x) seq_along(x) - which(x=="like") ) )
> d
subj condition sentence word position
1 1 A 1 I -1
2 1 A 1 like 0
3 1 A 1 dogs. 1
4 2 B 2 We -2
5 2 B 2 don't -1
6 2 B 2 like 0
7 2 B 2 this 1
8 2 B 2 song. 2
Run Code Online (Sandbox Code Playgroud)