Kay*_*fey 4 group-by time-series lag julia
我想知道是否有一种简单的方法可以根据分组或条件在 Julia 中创建时间序列变量的滞后(或领先)?例如:我有以下形式的数据集
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
? Row ? var1 ? var2 ?
? ? String ? Int64 ?
????????????????????????
? 1 ? a ? 0 ?
? 2 ? a ? 1 ?
? 3 ? a ? 2 ?
? 4 ? a ? 3 ?
? 5 ? b ? 0 ?
? 6 ? b ? 1 ?
? 7 ? b ? 2 ?
? 8 ? b ? 3 ?
Run Code Online (Sandbox Code Playgroud)
我想创建一个变量lag2,其中包含var2滞后 2的值。但是,这应该按 var1 进行分组,以便 'b' 组中的前两个观察值不会获得 'a' 组的最后两个值. 相反,它们应该设置为缺失或零或某个默认值。
我尝试了以下代码,但会产生以下错误。
julia> df2 = df1 |> @groupby(_.var1) |> @mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Run Code Online (Sandbox Code Playgroud)
感谢这种方法或替代方法的任何帮助。谢谢。
您绝对有正确的想法 - 我不使用 Query.jl 但这可以使用基本的 DataFrames 语法轻松完成:
julia> using DataFrames, ShiftedArrays
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
? Row ? var1 ? var2_l2 ?
? ? String ? Int64? ?
??????????????????????????
? 1 ? a ? missing ?
? 2 ? a ? missing ?
? 3 ? a ? 0 ?
? 4 ? a ? 1 ?
? 5 ? b ? missing ?
? 6 ? b ? missing ?
? 7 ? b ? 0 ?
? 8 ? b ? 1 ?
Run Code Online (Sandbox Code Playgroud)
请注意,我Base.Fix2在这里使用了lag. 这与定义您自己的l2(x) = lag(x, 2)然后l2在by调用中使用基本相同。如果您确实定义了自己的l2函数,您还可以设置默认值,就像l2(x) = lag(x, 2, default = -1000)您想避免丢失值一样:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
? Row ? var1 ? var2_l2 ?
? ? String ? Int64 ?
??????????????????????????
? 1 ? a ? -1000 ?
? 2 ? a ? -1000 ?
? 3 ? a ? 0 ?
? 4 ? a ? 1 ?
? 5 ? b ? -1000 ?
? 6 ? b ? -1000 ?
? 7 ? b ? 0 ?
? 8 ? b ? 1 ?
Run Code Online (Sandbox Code Playgroud)
在 DataFrames.jl 0.22.2 下,正确的语法是:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row ? var1 var2_l2
? String Int64?
???????????????????????
1 ? a missing
2 ? a missing
3 ? a 0
4 ? a 1
5 ? b missing
6 ? b missing
7 ? b 0
8 ? b 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
309 次 |
| 最近记录: |