如何在test.day之前找到每个(,)对的最后一个值?loc.xloc.y
dt <- data.table(
loc.x = as.integer(c(1, 1, 3, 1, 3, 1)),
loc.y = as.integer(c(1, 2, 1, 2, 1, 2)),
time = as.IDate(c("2015-03-11", "2015-05-10", "2015-09-27",
"2015-11-25", "2014-09-13", "2015-08-19")),
value = letters[1:6]
)
setkey(dt, loc.x, loc.y, time)
test.day <- as.IDate("2015-10-01")
Run Code Online (Sandbox Code Playgroud)
所需输出:
loc.x loc.y value
1: 1 1 a
2: 1 2 f
3: 3 1 c
Run Code Online (Sandbox Code Playgroud) labs = letters[3:7]
vec = rep(1:5,2)
Run Code Online (Sandbox Code Playgroud)
如何获得水平为“c”“d”“e”“f”“g”的因子?
分割下表的最有效(时间和空间相同)方式是什么
dt = data.table(x=c(1,3,5,4,6,2), y=c(4,7,1,1,2,6))
> dt
x y
1: 1 4
2: 3 7
3: 5 1
4: 4 1
5: 6 2
6: 2 6
Run Code Online (Sandbox Code Playgroud)
分为两个单独的表,dt1和dt2,这样dt1包含所有(x,y)行iff(y,x)也是dt中的一行,而dt2包含其他行:
> dt1
x y
1: 1 4
2: 4 1
3: 6 2
4: 2 6
> dt2
x y
1: 3 7
2: 5 1
Run Code Online (Sandbox Code Playgroud)
效率至关重要,全表有近200M行
如何找到每列的前k(比如k = 3)值的索引
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
Run Code Online (Sandbox Code Playgroud)
所需输出:
> output.1
x y
1: 1 2
2: 3 4
3: 5 5
Run Code Online (Sandbox Code Playgroud)
或者甚至更好(注意x中额外有用的降序排序):
> output.2
var top1 top2 top3
1: x 3 5 1
2: y 2 …Run Code Online (Sandbox Code Playgroud) 什么是最有效的traspose方式
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
Run Code Online (Sandbox Code Playgroud)
成:
> output
cn v1 v2 v3 v4 v5 v6 v7
1: x 1 1 3 1 3 1 1
2: y 1 2 1 2 2 1 1
Run Code Online (Sandbox Code Playgroud)
dcast.data.table应该是高效的,但我无法弄清楚它究竟是如何完成的
my_data <- c(232,294,320,314,336,189,331,185,161,140,49,7,0,3,4,9,38,169,275,316,366,422,328,283,213,238,220,193,250,308,224,190,188,99,41,17,19,9,1,3,10,108,149,189,168,170,155,101,119,89,142,169,192,242,152,141,105,76,39,20,17,13,5,3,8,54,102,102,155,159,164,200,183,144,204,190,219,158,128,142,130,86,58,13,12,0,6,4,20,302,297,312,345,293,233,275,233,199,279,250,208,161,200,181,133,140,17,14,2,0,2,4,36,183,379,371,356,425,320,282,172,214,226,250,196,239,183,194,135,75,28,11,2,3,5,4,29,212,316,343,375,431,225,248,209,258,262,230,218,162,193,178,126,131,37,7,5,3,0,1,20,149,258,408,316,307,352,247,285,236,254,321,233,175,264,114,104,82,37,49,4,16,2,14,22,169,259,355,379,346,261,256,220,238,227,201,242,185,121,160,114,91,33,9,4,2,0,2,22,62,114,156,190,186,140,155,141,135,140,137,179,128,156,124,98,66,63,32,27,0,21,5,4,39,73,162,175,207,183,121,174,107,160,177,258,170,152,165,117,59,35,69,7,0,3,3,28,98,165,194,200,190,162,160,170,200,189,187,141,224,152,115,111,47,20,15,2,0,0,29,10,59,170,212,164,201,193,182,277,283,376,310,194,247,177,164,140,192,95,49,10,10,2,5,38,52,156,331,480,378,231,172,132,199,245,267,192,223,182,168,152,81,20,14,13,6,14,16,6,21,51,113,94,103,113,93,205,98,118,97,138,112,98,99,79,74,71,38,31,30,31,38,41,48,131,159,212,134,150,145,149,105,142,149,122,137,193,105,68,75,35,33,41,38,33,29,44,54,85,109,118,117,113,107,112,92,112,98,111,81,120,113,66,55,10,20,26,25,3,10,15,30,60,91,97,67,100,99,75,92,98,126,116,103,110,87,124,66,55,30,31,28,28,31,29,49,109,144,152,116,106,88,164,127,121,161,186,104,81,79,103,69,47,35,35,30,28,34,42,56,114,110,149,153,112,151,138,151,141,139,206,225,166,173,185,384,221,100,61,51,35,44,38,83,87,182,205,243,191,144,106,112,167,234,147,136,152,107,156,53)
Run Code Online (Sandbox Code Playgroud)
从 acf/pacf 相关图中可以看出,my_data 有 24 个周期的清晰季节。
library(forecast)
tsdisplay(my_data)
Run Code Online (Sandbox Code Playgroud)
很遗憾
auto.arima(my_data, seasonal = TRUE, approximation = FALSE, stepwise = FALSE)
Run Code Online (Sandbox Code Playgroud)
只考虑 (p,d,q) 因素,而不是预期的 (p,d,q)(P,D,Q)[24]
Series: my_data
ARIMA(3,1,2)
Coefficients:
ar1 ar2 ar3 ma1 ma2
1.8061 -0.8164 -0.0587 -1.9453 0.9672
s.e. 0.0478 0.0896 0.0474 0.0178 0.0171
sigma^2 estimated as 2261: log likelihood=-2581.68
AIC=5175.36 AICc=5175.54 BIC=5200.52
Run Code Online (Sandbox Code Playgroud) 我有一个时间流逝可能超过 30 分钟的 Pandas DataFrame df。我想重新采样
r = df.resample('30T')
Run Code Online (Sandbox Code Playgroud)
然后应用一些聚合:
r.apply(my_fancy_aggregation)
Run Code Online (Sandbox Code Playgroud)
my_fancy_aggregation 不能在空的 array_likes 上工作。
在应用 my_fancy_aggregation 之前,如何从空聚合中清除 r?
什么是高效优雅的data.table语法,用于查找每个id的最常见类别?我保留一个表示NA位置的布尔矢量(用于其他目的)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
Run Code Online (Sandbox Code Playgroud)
在这个玩具示例中,忽略NA,x是for id==1和yfor的常见类别id==2.
我希望确定 sklearn LabelEncoder 的标签(即 0,1,2,3,...)以适应分类变量可能值的特定顺序(例如 ['b', 'a', 'c', 'd'])。LabelEncoder 选择按字典序拟合标签,我想可以在这个例子中看到:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])
Run Code Online (Sandbox Code Playgroud)
我怎样才能强制编码器坚持在 .fit 方法中第一次遇到的数据顺序(即,将“b”编码为 0,“a”编码为 1,“c”编码为 2,“d”编码为3)?
如何进入一个表格,他的'x'值对应于'y'和'z'的前k值?
> dt <- data.table( x = letters[c(1, 1, 3, 2, 3, 1, 1)],
y = c(1, 2, 1, 2, 2, 1, 1), z = c(1, 2, 3) )
> dt
x y z
1: a 1 1
2: a 2 2
3: c 1 3
4: b 2 1
5: c 2 2
6: a 1 3
7: a 1 1
Run Code Online (Sandbox Code Playgroud)
这种情况可以通过连接来解决,还是我必须遍历不是'x'的列?
> requested.output
var x Val
1: y a 2
2: y b 2
3: y c 2 …Run Code Online (Sandbox Code Playgroud) r ×8
data.table ×6
join ×2
python ×2
encoder ×1
forecasting ×1
frequency ×1
pandas ×1
r-factor ×1
resampling ×1
scikit-learn ×1
sorting ×1
split ×1
time-series ×1
transpose ×1