解析出字符串,将其设置为R data.table中的因子列

Lor*_*rai 4 grep r data.table

我无法找到实现这一目标的优雅方式,请帮忙.

我有一个DTdata.table:

name,value
"lorem pear ipsum",4
"apple ipsum lorem",2
"lorem ipsum plum",6
Run Code Online (Sandbox Code Playgroud)

基于列表,Fruits <- c("pear", "apple", "plum")我想创建一个因子类型列.

name,value,factor
"lorem pear ipsum",4,"pear"
"apple ipsum lorem",2,"apple"
"lorem ipsum plum",6,"plum"
Run Code Online (Sandbox Code Playgroud)

我想这是基本的,但我有点卡住了,这是我得到了多远:

DT[grep("apple", name, ignore.case=TRUE), factor := as.factor("apple")]

提前致谢.

And*_*rie 6

您可以使用正则表达式对其进行矢量化,例如使用gsub():

设置数据:

strings <- c("lorem pear ipsum", "apple ipsum lorem", "lorem ipsum plum")
fruit <- c("pear", "apple", "plum")
Run Code Online (Sandbox Code Playgroud)

现在创建一个正则表达式

ptn <- paste0(".*(", paste(fruit, collapse="|"), ").*")
gsub(ptn, "\\1", strings)
[1] "pear"  "apple" "plum" 
Run Code Online (Sandbox Code Playgroud)

正则表达式的工作原理是将每个搜索元素与|嵌入在括号内的内容分开:

ptn
[1] ".*(pear|apple|plum).*"
Run Code Online (Sandbox Code Playgroud)

要在数据表中执行此操作,根据您的问题,就像这样简单:

library(data.table)
DT <- data.table(name=strings, value=c(4, 2, 6))
DT[, factor:=gsub(ptn, "\\1", strings)]
DT

                name value factor
1:  lorem pear ipsum     4   pear
2: apple ipsum lorem     2  apple
3:  lorem ipsum plum     6   plum
Run Code Online (Sandbox Code Playgroud)


A5C*_*2T1 5

我不知道是否有更多的"data.table"方法,但你可以试试这个:

DT[, factor := sapply(Fruits, function(x) Fruits[grep(x, name, ignore.case=TRUE)])]
DT
#                 name value factor
# 1:  lorem pear ipsum     4   pear
# 2: apple ipsum lorem     2  apple
# 3:  lorem ipsum plum     6   plum
Run Code Online (Sandbox Code Playgroud)