And*_*eas 2 r dataframe data.table
我有以下 data.frame:
b<-structure(list(b = c("47.83006,11.71699 47.83004,11.71691 47.83002,11.7168 47.83001,11.71662",
"47.83001,11.71662 47.82993,11.71628 47.82991,11.7162 47.82988,11.71614 47.82983,11.71609 47.8295,11.71588 47.82919,11.71566 47.82898,11.71549 47.82845,11.71504 47.82832,11.715 47.82821,11.715 47.82712,11.71531 47.82639,11.71549 47.82606,11.71561 47.8257,11.71567 47.82548,11.71574 47.82433,11.71613",
"47.82433,11.71613 47.82436,11.7165 47.8244,11.71715 47.82442,11.71742 47.82453,11.71823 47.82459,11.71856 47.82492,11.7199",
"47.82492,11.7199 47.82495,11.72005 47.82503,11.72034 47.82515,11.72066 47.82526,11.72093 47.82556,11.72172 47.82559,11.72182 47.82561,11.72191 47.82562,11.72201",
"47.85051,12.11965 47.85092,12.11997", "48.10034,11.75948 48.10021,11.75938"
)), row.names = c(NA, 6L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
它由由空格分隔的坐标 lat,lon 对组成。
我怎样才能从这个结构中尽可能高效地创建一个 data.frame 或 data.table,将 lat 和 lon 值放在不同的行中?
Lat lon
47.83006 11.71699
47.83004 11.71691
47.83002 11.7168
…
Run Code Online (Sandbox Code Playgroud)
更新 感谢您的解决方案。我会选择@Gki 提案,因为它更快:
Unit: milliseconds
expr
c <- b %>% separate_rows(b, sep = " ") %>% separate(b, into = c("Lat", "Lon"), sep = ",", convert = T) %>% data.frame()
d <- read.csv(text = unlist(strsplit(b$b, " ", TRUE)), col.names = c("Lat", "Lon"))
min lq mean median uq max neval
12.363628 13.031700 14.027860 13.408883 13.703157 28.922909 100
1.020622 1.050315 1.119533 1.117269 1.170826 1.348833 100
Run Code Online (Sandbox Code Playgroud)
您可以使用strsplit来按值之间的空间进行拆分,然后使用read.csv来获取data.frame.
read.csv(text=unlist(strsplit(b$b, " ", TRUE)), col.names = c("Lat", "Lon"))
# Lat Lon
#1 47.83004 11.71691
#2 47.83002 11.71680
#3 47.83001 11.71662
#4 47.83001 11.71662
#5 47.82993 11.71628
#6 47.82991 11.71620
#7 47.82988 11.71614
#...
Run Code Online (Sandbox Code Playgroud)
或者从R 4.1.0 开始在base 中使用Forward Pipe Operator |>和函数快捷方式:\()
strsplit(b$b, " ", TRUE) |> unlist() |> (\(d) read.csv(text=d, col.names = c("Lat", "Lon")))()
# Lat Lon
#1 47.83004 11.71691
#2 47.83002 11.71680
#3 47.83001 11.71662
#...
Run Code Online (Sandbox Code Playgroud)
或者使用奇异的管道 ->.;而不是定义一个函数:
strsplit(b$b, " ", TRUE) |> unlist() ->.; read.csv(text=., col.names = c("Lat", "Lon"))
Run Code Online (Sandbox Code Playgroud)
跳过设置列标题时,转换为数字并生成矩阵的快速方法是:
do.call(rbind, strsplit(unlist(strsplit(b$b, " ", TRUE)), ",", TRUE))
Run Code Online (Sandbox Code Playgroud)
或将其转换为数字:
matrix(as.numeric(unlist(strsplit(unlist(strsplit(b$b, " ", TRUE)), ",", TRUE))), ncol=2, byrow=TRUE)
Run Code Online (Sandbox Code Playgroud)
使用data.table@mt1022的解决方案进行比较:
library(data.table)
microbenchmark::microbenchmark(
base = do.call(rbind, strsplit(unlist(strsplit(b$b, " ", TRUE)), ",", TRUE))
, baseNum = matrix(as.numeric(unlist(strsplit(unlist(strsplit(b$b, " ", TRUE)), ",", TRUE))), ncol=2, byrow=TRUE)
, data.table = as.data.table(tstrsplit(unlist(strsplit(b$b, ' ', T)), ',', T))
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# base 28.829 30.2965 33.08313 31.5705 33.0475 85.880 100 a
# baseNum 29.832 31.3030 33.51445 32.3635 34.5395 56.851 100 a
# data.table 143.745 147.9900 155.41194 150.9960 157.2420 278.190 100 b
Run Code Online (Sandbox Code Playgroud)