Abh*_*bhi 7 r sparse-matrix dataframe
我有一个数据框,大部分是零(稀疏数据帧?)类似于
name,factor_1,factor_2,factor_3
ABC,1,0,0
DEF,0,1,0
GHI,0,0,1
Run Code Online (Sandbox Code Playgroud)
实际数据大约是90,000行,包含10,000个功能.我可以将其转换为稀疏矩阵吗?我期望通过利用稀疏矩阵而不是数据帧来获得时间和空间效率.
任何帮助,将不胜感激
更新#1:这是生成数据帧的一些代码.谢谢理查德提供这个
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", "GHI"),
class = "factor"),
factor_1 = c(1L, 0L, 0L),
factor_2 = c(0L,1L, 0L),
factor_3 = c(0L, 0L, 1L)),
.Names = c("name", "factor_1","factor_2", "factor_3"),
class = "data.frame",
row.names = c(NA,-3L))
Run Code Online (Sandbox Code Playgroud)
为了避免将所有数据复制到密集矩阵中,可能会有更高的内存效率(但速度更慢):
y <- Reduce(cbind2, lapply(x[,-1], Matrix, sparse = TRUE))
rownames(y) <- x[,1]
#3 x 3 sparse Matrix of class "dgCMatrix"
#
#ABC 1 . .
#DEF . 1 .
#GHI . . 1
Run Code Online (Sandbox Code Playgroud)
如果你有足够的内存,你应该使用理查德的答案,即将你的data.frame变成密集矩阵而不是使用Matrix.
我一直这样做,这是一个痛苦的屁股,所以我在我的R包中写了一个名为sparsify()的方法- mltools.它运行data.table的只是花哨的data.frames.
解决你的具体问题......
安装mltools(或者只是将sparsify()方法复制到您的环境中)
加载包
library(data.table)
library(Matrix)
library(mltools)
Run Code Online (Sandbox Code Playgroud)
Sparsify
x <- data.table(x) # convert x to a data.table
sparseM <- sparsify(x[, !"name"]) # sparsify everything except the name column
rownames(sparseM) <- x$name # set the rownames
> sparseM
3 x 3 sparse Matrix of class "dgCMatrix"
factor_1 factor_2 factor_3
ABC 1 . .
DEF . 1 .
GHI . . 1
Run Code Online (Sandbox Code Playgroud)
通常,sparsify()方法非常灵活.以下是一些如何使用它的示例:
制作一些数据.注意数据类型和未使用的因子级别
dt <- data.table(
intCol=c(1L, NA_integer_, 3L, 0L),
realCol=c(NA, 2, NA, NA),
logCol=c(TRUE, FALSE, TRUE, FALSE),
ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)
> dt
intCol realCol logCol ofCol ufCol
1: 1 NA TRUE a a
2: NA 2 FALSE b NA
3: 3 NA TRUE NA c
4: 0 NA FALSE b b
Run Code Online (Sandbox Code Playgroud)
开箱即用
> sparsify(dt)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 NA 1 1 1 . .
[2,] NA 2 . 2 NA NA NA
[3,] 3 NA 1 NA . . 1
[4,] . NA . 2 . 1 .
Run Code Online (Sandbox Code Playgroud)
将NAs转换为0并将其稀疏化
> sparsify(dt, sparsifyNAs=TRUE)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 . 1 1 1 . .
[2,] . 2 . 2 . . .
[3,] 3 . 1 . . . 1
[4,] . . . 2 . 1 .
Run Code Online (Sandbox Code Playgroud)
生成标识NA值的列
> sparsify(dt[, list(realCol)], naCols="identify")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NA realCol
[1,] 1 NA
[2,] . 2
[3,] 1 NA
[4,] 1 NA
Run Code Online (Sandbox Code Playgroud)
生成以最具记忆效率的方式识别NA值的列
> sparsify(dt[, list(realCol)], naCols="efficient")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NotNA realCol
[1,] . NA
[2,] 1 2
[3,] . NA
[4,] . NA
Run Code Online (Sandbox Code Playgroud)
您可以将第一列设置为行名称,然后Matrix从Matrix包中使用。
rownames(x) <- x$name
x <- x[-1]
library(Matrix)
Matrix(as.matrix(x), sparse = TRUE)
# 3 x 3 sparse Matrix of class "dtCMatrix"
# factor_1 factor_2 factor_3
# ABC 1 . .
# DEF . 1 .
# GHI . . 1
Run Code Online (Sandbox Code Playgroud)
原始x数据框在哪里
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF",
"GHI"), class = "factor"), factor_1 = c(1L, 0L, 0L), factor_2 = c(0L,
1L, 0L), factor_3 = c(0L, 0L, 1L)), .Names = c("name", "factor_1",
"factor_2", "factor_3"), class = "data.frame", row.names = c(NA,
-3L))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7258 次 |
| 最近记录: |