我目前正在使用一系列大型数据集,并且我正在努力改进我在R中编写脚本的方式.我倾向于主要使用for循环,我知道这些循环很麻烦且很慢,尤其是非常大的数据集.
我听过很多人推荐apply()系列来避免复杂的for循环,但是我很难用它们一次性应用多个函数.
这是一些简单的示例数据:
A <- data.frame('Area' = c(4, 6, 5),
'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')
Run Code Online (Sandbox Code Playgroud)
我想要做的是通过'flow'变量对数据进行排序,然后为每个数据点所代表的总'flow'和'area'部分添加列,最后再添加两列每个变量的累积百分比.
目前我用这个循环:
sort_files <- list()
n <- 1
for(i in files){
name <- frames[n]
nom <- paste(name,'_sorted', sep = '')
data <- i[order(-i$flow),]
area <- sum(i$Area)
total <- sum(i$flow)
data$area_portion <- (data$Area/area)*100
data$flow_portion <- (data$flow/total)*100
data$cum_area <- cumsum(data$area_portion)
data$cum_flow <- cumsum(data$flow_portion)
assign(nom, data)
df <- get(paste(name,'_sorted', sep = ''))
sort_files[[nom]] <- df
n <- n + 1
}
Run Code Online (Sandbox Code Playgroud)
哪个有效,但看起来过于复杂和丑陋,而且我相信它会比更好的脚本运行得慢得多.
如何简化和排除上述代码?
这是预期的输出:
sort_files
$`A_sorted`
Area flow area_portion flow_portion cum_area cum_flow
1 4 1 26.66667 33.33333 26.66667 33.33333
2 6 1 40.00000 33.33333 66.66667 66.66667
3 5 1 33.33333 33.33333 100.00000 100.00000
$B_sorted
Area flow area_portion flow_portion cum_area cum_flow
2 8 2 44.44444 50 44.44444 50
1 6 1 33.33333 25 77.77778 75
3 4 1 22.22222 25 100.00000 100
Run Code Online (Sandbox Code Playgroud)
Ron*_*hah 15
使用lapply
了循环files
,并dplyr
mutate
增加新的列
library(dplyr)
setNames(lapply(files, function(x)
x %>%
arrange(desc(flow)) %>%
mutate(area_portion = Area/sum(Area)*100,
flow_portion = flow/sum(flow) * 100,
cum_area = cumsum(area_portion),
cum_flow = cumsum(flow_portion))
),paste0(frames, "_sorted"))
#$A_sorted
# Area flow area_portion flow_portion cum_area cum_flow
#1 4 1 26.66667 33.33333 26.66667 33.33333
#2 6 1 40.00000 33.33333 66.66667 66.66667
#3 5 1 33.33333 33.33333 100.00000 100.00000
#$B_sorted
# Area flow area_portion flow_portion cum_area cum_flow
#1 8 2 44.44444 50 44.44444 50
#2 6 1 33.33333 25 77.77778 75
#3 4 1 22.22222 25 100.00000 100
Run Code Online (Sandbox Code Playgroud)
或完全去tidyverse
的方式,我们可以改变lapply
与map
和setNames
与set_names
library(tidyverse)
map(set_names(files, str_c(frames, "_sorted")),
. %>% arrange(desc(flow)) %>%
mutate(area_portion = Area/sum(Area)*100,
flow_portion = flow/sum(flow) * 100,
cum_area = cumsum(area_portion),
cum_flow = cumsum(flow_portion)))
Run Code Online (Sandbox Code Playgroud)
tidyverse
根据@Moody_Mudskipper的一些指示更新了方法.
你也可以先定义一个函数..
f <- function(data) {
# sort data by flow
data <- data[order(data['flow'], decreasing = TRUE), ]
# apply your functions
data["area_portion"] <- data['Area'] / sum(data['Area']) * 100
data["flow_portion"] <- data['flow'] / sum(data['flow']) * 100
data["cum_area"] <- cumsum(data['area_portion'])
data["cum_flow"] <- cumsum(data['flow_portion'])
data
}
Run Code Online (Sandbox Code Playgroud)
..并使用lapply
,啊,适用f
于您的列表
out <- lapply(files, f)
out
#[[1]]
# Area flow area_portion flow_portion cum_area cum_flow
#1 4 1 26.66667 33.33333 26.66667 33.33333
#2 6 1 40.00000 33.33333 66.66667 66.66667
#3 5 1 33.33333 33.33333 100.00000 100.00000
#[[2]]
# Area flow area_portion flow_portion cum_area cum_flow
#2 8 2 44.44444 50 44.44444 50
#1 6 1 33.33333 25 77.77778 75
#3 4 1 22.22222 25 100.00000 100
Run Code Online (Sandbox Code Playgroud)
如果要更改名称,out
可以使用setNames
out <- setNames(lapply(files, f), paste0(c("A", "B"), "_sorted"))
# or
# out <- setNames(lapply(files, f), paste0(unlist(frames), "_sorted"))
Run Code Online (Sandbox Code Playgroud)