Der*_*wis 72 r data.table
我有一个脚本,将CSV文件中的数据读入a data.table,然后将一列中的文本拆分为几个新列.我目前正在使用lapply和strsplit函数来执行此操作.这是一个例子:
library("data.table")
df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"),
VALUE = 1:6)
dt = as.data.table(df)
# split PREFIX into new columns
dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))
dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))
dt
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
Run Code Online (Sandbox Code Playgroud)
在上面的示例中,该列PREFIX分为两个新列PX和PY"_"字符.
虽然这很好用,但我想知道是否有更好(更有效)的方法来使用它data.table.我的真实数据集有> = 10M +行,因此时间/内存效率变得非常重要.
按照@Frank的建议,我创建了一个更大的测试用例并使用了建议的命令,但是stringr::str_split_fixed比原始方法需要更长的时间.
library("data.table")
library("stringr")
system.time ({
df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),
VALUE = rep(1:6, 1000000))
dt = data.table(df)
})
# user system elapsed
# 0.682 0.075 0.758
system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] })
# user system elapsed
# 738.283 3.103 741.674
rm(dt)
system.time ( {
df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),
VALUE = rep(1:6, 1000000) )
dt = as.data.table(df)
})
# user system elapsed
# 0.123 0.000 0.123
# split PREFIX into new columns
system.time ({
dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))
dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))
})
# user system elapsed
# 33.185 0.000 33.191
Run Code Online (Sandbox Code Playgroud)
因此该str_split_fixed方法需要大约20倍的时间.
Aru*_*run 106
更新:从版本1.9.6(截至2015年9月的CRAN),我们可以使用该函数tstrsplit()直接获得结果(并以更有效的方式):
require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
Run Code Online (Sandbox Code Playgroud)
tstrsplit()基本上是一个包装器transpose(strsplit()),其中transpose()函数也是最近实现的,转换列表.请参阅?tstrsplit()和?transpose()示例.
查看旧答案的历史记录.
Ha *_*ham 14
我为那些不使用data.table v1.9.5并且想要一行解决方案的人添加了答案.
dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]
Run Code Online (Sandbox Code Playgroud)
使用splitstackshape包:
library(splitstackshape)
cSplit(df, splitCols = "PREFIX", sep = "_", direction = "wide", drop = FALSE)
# PREFIX VALUE PREFIX_1 PREFIX_2
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
Run Code Online (Sandbox Code Playgroud)
小智 6
我们可以尝试:
library(data.table)
cbind(dt, fread(text = dt$PREFIX, sep = "_", header = FALSE))
# PREFIX VALUE V1 V2
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
Run Code Online (Sandbox Code Playgroud)