wat*_*wer 1 r zoo rcpp data.table
我有一个大表:10M行乘33列,其中28列有一些NA值.这些NA值需要使用修补locf().我在这个主题上读了几个线程(在Rcpp中的单个R data.table和na.locf以及inverse.rle中按组有效地进行locf).但是,这些线程是关于替换数字向量.我不太熟悉Rcpp所以我不知道如何更改他们的代码以满足字符串---我的数据都是字符串.
以下是我的示例数据:
输入数据
Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being",
NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens",
NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA,
"Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China",
"China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East",
NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021,
NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N",
"N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA,
NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, 8L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
这是我的代码使用data.table:
data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]
Run Code Online (Sandbox Code Playgroud)
预期产出:
Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being",
"Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens",
"Homo Sapiens", "Passeridae", "Passeridae", "Passeridae",
"Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia",
"Asia", "Australia", "Australia", "Australia", "Australia"
), SL2_Country = c("China", "China", "China", "China", "Australia",
"Australia", "Australia", "Australia"), SL3_Direction = c("East",
"East", "East", "East", "West", "West", "West", "West"),
Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012,
2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD",
"TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes",
"Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, -8L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
虽然上面的代码只需要几分之一秒来执行,但是在我的原始数据集中处理一列需要大约10分钟,即使使用,也可以处理28列data.table.
我假设我并没有真正利用上述的力量data.table.我不太确定.我真诚地感谢任何帮助加快na.locf()功能.
有没有更有效的方法来取代NA上面?
为了这个例子的目的,我简化了问题,但我想这很容易概括.下面的代码locppf使用C++ 11语法定义Rcpp中的函数:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
using Map = std::unordered_map<double, int> ;
using Pair = Map::value_type ;
// [[Rcpp::export]]
CharacterVector locppf(NumericVector g, CharacterVector s) {
auto n = g.size() ;
CharacterVector out = clone(s) ;
Map map ;
for(int i=n-1; i>=0; i--){
double value = g[i] ;
auto it = map.find( value ) ;
if( it == map.end() ){
map.insert( Pair(value, i) ) ;
} else {
// if the current value is NA, replace it with the data at correct idx
auto current = s[i] ;
if( CharacterVector::is_na( current ) ){
out[i] = s[ it->second ] ;
} else {
it->second = i ;
}
}
}
return out ;
}
Run Code Online (Sandbox Code Playgroud)
我们的想法是定义一个地图来跟踪我们上次看到不在组NA中的东西的索引.我std::unordered_map<double, int>用作地图,因为你的例子也使用了数字向量.
让我们打破相关的掘金:
if( it == map.end() ){
map.insert( Pair(value, i) ) ;
}
Run Code Online (Sandbox Code Playgroud)
在这里,我们检查地图是否已经看到当前值,如果不是,我们保留当前索引.
auto current = s[i] ;
if( CharacterVector::is_na( current ) ){
out[i] = s[ it->second ] ;
} else {
it->second = i ;
}
Run Code Online (Sandbox Code Playgroud)
在这里,我们检查当前值是否为NA CharacterVector::is_na.
如果是,我们用我们之前保留的索引中的值填充结果向量.
如果没有,我们将更改此组的地图记住的索引.
现在让我们给自己一些数据:
library("zoo")
library("dplyr")
library("data.table")
with_holes <- function(x, p = .2){
n <- length(x)
x[ sample(n, n*p) ] <- NA
x
}
n <- 1e6
x <- sample( as.numeric(1:100), n, replace= TRUE )
y <- with_holes( sample( letters, n, replace = TRUE) )
d <- data_frame( x = x, y = y )
Run Code Online (Sandbox Code Playgroud)
并通过各种选项测量时间:
使用dplyr语法group_by,mutate和na.locf
> system.time( d %>% group_by(x) %>% mutate( y = na.locf(y, fromLast = TRUE, na.rm = FALSE) ) )
user system elapsed
0.173 0.023 0.198
Run Code Online (Sandbox Code Playgroud)
使用data.table语法na.locf.我不保证这是最好的data.table方法.
> d2 <- as.data.table(d)
> system.time( d2[ , y := na.locf(y, fromLast = TRUE, na.rm = FALSE) , x ] )
user system elapsed
0.159 0.030 0.188
Run Code Online (Sandbox Code Playgroud)
没有自定义locppf功能:
> system.time( locppf(d$x, d$y) )
user system elapsed
0.028 0.001 0.028
Run Code Online (Sandbox Code Playgroud)