我一直在尝试复制这个矩阵(R
)Julia
data = read.csv("https://gist.githubusercontent.com/TJhon/158daa0c2dd06010d01a72dae2af8314/raw/61df065c98ec90b9ea3b8598d1996fb5371a64aa/rnd.csv")\n\nhead(model.matrix(y ~ (x1 + x2 + x3 + x4 + x5)^2, data), 3)\n#> (Intercept) x1 x2 x3 x4 x5 x1:x2\n#> 1 1 -0.3007225 -1.3710894 0.3423409 1.322547 2 0.41231744\n#> 2 1 0.4674170 0.8728939 0.9534157 -1.007083 1 0.40800548\n#> 3 1 0.2085316 -0.3657995 -0.3043694 -1.036938 4 -0.07628076\n#> x1:x3 x1:x4 x1:x5 x2:x3 x2:x4 x2:x5 x3:x4\n#> 1 -0.10294961 -0.3977198 -0.6014450 -0.4693799 -1.8133307 -2.7421787 0.4527620\n#> 2 0.44564276 -0.4707279 0.4674170 0.8322308 -0.8790769 0.8728939 -0.9601690\n#> 3 -0.06347064 -0.2162343 0.8341265 0.1113382 0.3793113 -1.4631979 0.3156121\n#> x3:x5 x4:x5\n#> 1 0.6846817 2.645095\n#> 2 0.9534157 -1.007083\n#> 3 -1.2174775 -4.147751\n
Run Code Online (Sandbox Code Playgroud)\n由reprex 包于 2022 年 10 月 18 日创建(v2.0.1)
\n我尝试过
\nusing CSV, DataFrames, StatsModels, StatsBase\n\ndata = CSV.read(download("https://gist.githubusercontent.com/TJhon/158daa0c2dd06010d01a72dae2af8314/raw/61df065c98ec90b9ea3b8598d1996fb5371a64aa/rnd.csv"), DataFrame) \n\nModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5) * (x1 + x2 + x3 + x4 + x5)), data)).m\n\n9\xc3\x9731 Matrix{Float64}:\n 1.0 -0.300723 -1.37109 0.342341 1.32255 2.0 0.090434 0.412317 -0.10295 -0.39772 \xe2\x80\xa6 0.452762 1.74913 2.64509 -0.601445 -2.74218 0.684682 2.64509 4.0\n 1.0 0.467417 0.872894 0.953416 -1.00708 1.0 0.218479 0.408005 0.445643 -0.470728 -0.960169 1.01422 -1.00708 0.467417 0.872894 0.953416 -1.00708 1.0\n\n 1.0 0.395908 -1.15159 -0.204683 -0.207952 2.0 0.156743 -0.455924 -0.0810355 -0.0823297 0.0425641 0.0432439 -0.415903 0.791816 -2.30318 -0.409365 -0.415903 4.0\n
Run Code Online (Sandbox Code Playgroud)\nModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5) & (x1 + x2 + x3 + x4 + x5)), data)).m\n\n9\xc3\x9726 Matrix{Float64}:\n 1.0 0.090434 0.412317 -0.10295 -0.39772 -0.601445 0.412317 1.87989 -0.46938 -1.81333 \xe2\x80\xa6 0.452762 1.74913 2.64509 -0.601445 -2.74218 0.684682 2.64509 4.0 \n 1.0 0.218479 0.408005 0.445643 -0.470728 0.467417 0.408005 0.761944 0.832231 -0.879077 -0.960169 1.01422 -1.00708 0.467417 0.872894 0.953416 \n\n 1.0 0.156743 -0.455924 -0.0810355 -0.0823297 0.791816 -0.455924 1.32616 0.235711 0.239475 0.0425641 0.0432439 -0.415903 0.791816 -2.30318 -0.409365 -0.415903 4.0 \n
Run Code Online (Sandbox Code Playgroud)\nModelMatrix(ModelFrame(@formula(y ~ (x1 + x2 + x3 + x4 + x5)^2), data)).m\n\n9\xc3\x972 Matrix{Float64}:\n 1.0 3.97235\n 1.0 5.22874\n : :\n 1.0 0.691696\n
Run Code Online (Sandbox Code Playgroud)\n我想要相同的变量名数组和向量稍后将其转换为数据帧。
\nxs = term.((Symbol("x$i") for i=1:5))
ff = vcat(term(1), xs, [xs[a] & xs[b] for a in 1:5 for b in a+1:5])
ModelMatrix(ModelFrame(FormulaTerm(Term(:y),Tuple(ff)), data)).m
Run Code Online (Sandbox Code Playgroud)
可以工作,但是比 R 版本更难看。也许有更好的方法。
并且:
varnames = vcat("(intercept)", ["x$i" for i=1:5], ["x$(a):x$(b)" for a in 1:5 for b in a+1:5])
Run Code Online (Sandbox Code Playgroud)
更新
上面的解决方案(在工作时)并不是特别好或朱利安,因此这里是一个试图更通用的重写:
using CSV, DataFrames, StatsModels, StatsBase
URL = join([
"https:","","gist.githubusercontent.com",
"TJhon","158daa0c2dd06010d01a72dae2af8314",
"raw","61df065c98ec90b9ea3b8598d1996fb5371a64aa","rnd.csv"],
'/')
data = CSV.read(download(URL), DataFrame)
using Combinatorics
subsets(X; from=0, upto=length(X)) =
Iterators.flatten(combinations(X,i) for i=max(0,from):min(upto,length(X)))
xs = term.((Symbol("x$i") for i=1:5))
term_vec = collect(subsets(xs; from=1, upto=2))
rhs = vcat(ConstantTerm(1), map(x->reduce(&, x), term_vec))
rhs_names = vcat("(intercept)", [join(string.(x),'*') for x in term_vec])
ModelMatrix(ModelFrame(FormulaTerm(Term(:y),Tuple(rhs)), data)).m
Run Code Online (Sandbox Code Playgroud)
它看起来更长,但最初的部分只是为了方便复制粘贴,并且它的好处是只需在定义中替换upto=2
with即可允许三项交互。upto=3
term_vec
特别是,subsets
迭代器在许多情况下应该非常有用,并且是添加到迭代器的好主意(如果同意,请评论)。