StatsModels.jl:Julia 中的指数公式 `formula(y ~ (x1 + x2 + x3 + x4 + x5)^2`

Jho*_*jas 5 r julia

我一直在尝试复制这个矩阵(R )Julia

\n
data = read.csv("https://gist.githubusercontent.com/TJhon/158daa0c2dd06010d01a72dae2af8314/raw/61df065c98ec90b9ea3b8598d1996fb5371a64aa/rnd.csv")\n\nhead(model.matrix(y ~ (x1 + x2 + x3 + x4 + x5)^2, data), 3)\n#>   (Intercept)         x1         x2         x3        x4 x5       x1:x2\n#> 1           1 -0.3007225 -1.3710894  0.3423409  1.322547  2  0.41231744\n#> 2           1  0.4674170  0.8728939  0.9534157 -1.007083  1  0.40800548\n#> 3           1  0.2085316 -0.3657995 -0.3043694 -1.036938  4 -0.07628076\n#>         x1:x3      x1:x4      x1:x5      x2:x3      x2:x4      x2:x5      x3:x4\n#> 1 -0.10294961 -0.3977198 -0.6014450 -0.4693799 -1.8133307 -2.7421787  0.4527620\n#> 2  0.44564276 -0.4707279  0.4674170  0.8322308 -0.8790769  0.8728939 -0.9601690\n#> 3 -0.06347064 -0.2162343  0.8341265  0.1113382  0.3793113 -1.4631979  0.3156121\n#>        x3:x5     x4:x5\n#> 1  0.6846817  2.645095\n#> 2  0.9534157 -1.007083\n#> 3 -1.2174775 -4.147751\n
Run Code Online (Sandbox Code Playgroud)\n

由reprex 包于 2022 年 10 月 18 日创建(v2.0.1)

\n

我尝试过

\n
using CSV, DataFrames, StatsModels, StatsBase\n\ndata = CSV.read(download("https://gist.githubusercontent.com/TJhon/158daa0c2dd06010d01a72dae2af8314/raw/61df065c98ec90b9ea3b8598d1996fb5371a64aa/rnd.csv"), DataFrame) \n\nModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5) * (x1 + x2 + x3 + x4 + x5)), data)).m\n\n9\xc3\x9731 Matrix{Float64}:\n 1.0  -0.300723  -1.37109    0.342341   1.32255   2.0  0.090434    0.412317   -0.10295    -0.39772    \xe2\x80\xa6   0.452762    1.74913     2.64509   -0.601445  -2.74218    0.684682   2.64509    4.0\n 1.0   0.467417   0.872894   0.953416  -1.00708   1.0  0.218479    0.408005    0.445643   -0.470728      -0.960169    1.01422    -1.00708    0.467417   0.872894   0.953416  -1.00708    1.0\n\n 1.0   0.395908  -1.15159   -0.204683  -0.207952  2.0  0.156743   -0.455924   -0.0810355  -0.0823297      0.0425641   0.0432439  -0.415903   0.791816  -2.30318   -0.409365  -0.415903   4.0\n
Run Code Online (Sandbox Code Playgroud)\n
ModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5) & (x1 + x2 + x3 + x4 + x5)), data)).m\n\n9\xc3\x9726 Matrix{Float64}:\n 1.0  0.090434    0.412317   -0.10295    -0.39772    -0.601445   0.412317   1.87989   -0.46938   -1.81333    \xe2\x80\xa6   0.452762    1.74913     2.64509   -0.601445  -2.74218    0.684682   2.64509    4.0   \n 1.0  0.218479    0.408005    0.445643   -0.470728    0.467417   0.408005   0.761944   0.832231  -0.879077      -0.960169    1.01422    -1.00708    0.467417   0.872894   0.953416  \n\n 1.0  0.156743   -0.455924   -0.0810355  -0.0823297   0.791816  -0.455924   1.32616    0.235711   0.239475       0.0425641   0.0432439  -0.415903   0.791816  -2.30318   -0.409365  -0.415903   4.0   \n
Run Code Online (Sandbox Code Playgroud)\n
ModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5)^2), data)).m\n\n9\xc3\x972 Matrix{Float64}:\n 1.0   3.97235\n 1.0   5.22874\n :     :\n 1.0   0.691696\n
Run Code Online (Sandbox Code Playgroud)\n

我想要相同的变量名数组和向量稍后将其转换为数据帧。

\n

Dan*_*etz 3

xs = term.((Symbol("x$i") for i=1:5))
ff = vcat(term(1), xs, [xs[a] & xs[b] for a in 1:5 for b in a+1:5])
ModelMatrix(ModelFrame(FormulaTerm(Term(:y),Tuple(ff)), data)).m
Run Code Online (Sandbox Code Playgroud)

可以工作,但是比 R 版本更难看。也许有更好的方法。

并且:

varnames = vcat("(intercept)", ["x$i" for i=1:5], ["x$(a):x$(b)" for a in 1:5 for b in a+1:5])
Run Code Online (Sandbox Code Playgroud)

更新

上面的解决方案(在工作时)并不是特别好或朱利安,因此这里是一个试图更通用的重写:

using CSV, DataFrames, StatsModels, StatsBase

URL = join([
  "https:","","gist.githubusercontent.com",
  "TJhon","158daa0c2dd06010d01a72dae2af8314",
  "raw","61df065c98ec90b9ea3b8598d1996fb5371a64aa","rnd.csv"],
  '/')

data = CSV.read(download(URL), DataFrame) 

using Combinatorics

subsets(X; from=0, upto=length(X)) =
  Iterators.flatten(combinations(X,i) for i=max(0,from):min(upto,length(X)))

xs = term.((Symbol("x$i") for i=1:5))

term_vec = collect(subsets(xs; from=1, upto=2))
rhs = vcat(ConstantTerm(1), map(x->reduce(&, x), term_vec))
rhs_names = vcat("(intercept)", [join(string.(x),'*') for x in term_vec])

ModelMatrix(ModelFrame(FormulaTerm(Term(:y),Tuple(rhs)), data)).m
Run Code Online (Sandbox Code Playgroud)

它看起来更长,但最初的部分只是为了方便复制粘贴,并且它的好处是只需在定义中替换upto=2with即可允许三项交互。upto=3term_vec

特别是,subsets迭代器在许多情况下应该非常有用,并且是添加到迭代器的好主意(如果同意,请评论)。