使用高度重复的密钥减少data.table的内存占用

Question

使用高度重复的密钥减少data.table的内存占用

我正在写一个包来分析高通量动物行为数据R.数据是多变量时间序列.我选择使用data.tables它来代表它们,我发现它非常方便.

对于一只动物,我会有类似的东西:

one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20))

Run Code Online (Sandbox Code Playgroud)

然而,我和我的用户和许多动物一起工作,这些动物具有不同的任意治疗,条件和在每只动物中恒定的其他变量.

最后,我发现代表数据的最方便的方法是将所有动物的行为和所有实验合并在一个数据表中,并使用额外的列,我设置为关键,这些列为"重复"变量".

所以,在概念上,类似的东西:

animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)
setkeyv(final_dt,c("treatment", "date","animal_id"))

Run Code Online (Sandbox Code Playgroud)

这种方式可以非常方便地计算每只动物的摘要,同时对所有生物信息(治疗等)不可知.

在实践中,我们为每只动物提供了数百万(而不是20)个连续读取,因此为方便起见,我们添加的列包含高度重复的值,这不是内存效率.

有没有办法压缩这个高度冗余的密钥而不会丢失表的结构(即列)？理想情况下,我不想强迫我的用户自己使用JOIN.

Answer 1

Uwe*_*Uwe 5

让我们假设，我们是一名数据库管理员，任务是在 SQL 数据库中有效地实现这一点。数据库规范化的目标之一是减少冗余。

根据 OP 的描述，每只动物（多变量、纵向数据）有很多（大约 1 M）个观察值，而动物的数量似乎要少得多。

So, the constant (or invariant) base data of each animal, e.g., treatment, date, should be kept separately from the observations.

animal_id is the key into both tables assuming animal_id is unique (as the name suggests).

(Note that this is the main difference to Mallick's answer who uses treatment as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.)

Separate tables are memory efficient

For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal:

library(data.table)   # CRAN version 1.10.4 used
# create observations
n_obs <- 1E6L
n_animals <-10L
set.seed(123L)
observations <- data.table(
  animal_id = rep(seq_len(n_animals), each = n_obs),
  t = rep(seq_len(n_obs), n_animals),
  x = rnorm(n_animals * n_obs), 
  y = rnorm(n_animals * n_obs))
# create animal base data
animals = data.table(
  animal_id = seq_len(n_animals),
  treatment = wakefield::string(n_animals),
  date = wakefield::date_stamp(n_animals, random = TRUE))

Run Code Online (Sandbox Code Playgroud)

Here the wakefield package is used to create dummy names and dates. Note that animal_id is of type integer.

> str(observations)
Classes ‘data.table’ and 'data.frame':    10000000 obs. of  4 variables:
 $ animal_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ t        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ x        : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
 $ y        : num  0.696 -0.537 -3.043 1.849 -1.085 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(animals)
Classes ‘data.table’ and 'data.frame':    10 obs. of  3 variables:
 $ animal_id: int  1 2 3 4 5 6 7 8 9 10
 $ treatment:Classes 'variable', 'character'  atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ...
  .. ..- attr(*, "varname")= chr "String"
 $ date     : variable, format: "2017-07-02" "2016-10-02" ...
 - attr(*, ".internal.selfref")=<externalptr>

Run Code Online (Sandbox Code Playgroud)

The combined size is about 240 Mbytes:

> object.size(observations)
240001568 bytes
> object.size(animals)
3280 bytes
Run Code Online (Sandbox Code Playgroud)

Let's take this is a reference and compare with the OP's approach final_dt:

# join both tables to create equivalent of final_dt
joined <- animals[observations, on = "animal_id"]

Run Code Online (Sandbox Code Playgroud)

The size has now nearly doubled (400 Mbytes) which is not memory efficient.

> object.size(joined)
400003432 bytes
Run Code Online (Sandbox Code Playgroud)

Note that no data.table key was set so far. Instead the on parameter was used to specify the column to join on. If we set the key, joins will be speed up and the on parameter can be omitted:

setkey(observations, animal_id)
setkey(animals, animal_id)
joined <- animals[observations]

Run Code Online (Sandbox Code Playgroud)

How to work with separate tables?

Now, we have demonstrated that it is memory efficient to use two separate tables.

For subsequent analysis, we can aggregate the observations per animal, e.g.,

observations[, .(.N, mean(x), mean(y)), by = animal_id]

Run Code Online (Sandbox Code Playgroud)

    animal_id       N            V2            V3
 1:         1 1000000 -5.214370e-04 -0.0019643145
 2:         2 1000000 -1.555513e-03  0.0002489457
 3:         3 1000000  1.541233e-06 -0.0005317967
 4:         4 1000000  1.775802e-04  0.0016212182
 5:         5 1000000 -9.026074e-04  0.0015266330
 6:         6 1000000 -1.000892e-03  0.0003284044
 7:         7 1000000  1.770055e-04 -0.0018654386
 8:         8 1000000  1.919562e-03  0.0008605261
 9:         9 1000000  1.175696e-03  0.0005042170
10:        10 1000000  1.681614e-03  0.0020562628

Run Code Online (Sandbox Code Playgroud)

and join the aggregates with animals

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]]

Run Code Online (Sandbox Code Playgroud)

    animal_id  treatment       date       N            V2            V3
 1:         1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145
 2:         2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03  0.0002489457
 3:         3 ifdtywJ4jU 2016-10-02 1000000  1.541233e-06 -0.0005317967
 4:         4 Q7ZRwnQCsU 2017-02-02 1000000  1.775802e-04  0.0016212182
 5:         5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04  0.0015266330
 6:         6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03  0.0003284044
 7:         7 rBxjewyGML 2017-02-02 1000000  1.770055e-04 -0.0018654386
 8:         8 gQP8cZhcTT 2017-04-02 1000000  1.919562e-03  0.0008605261
 9:         9 0GEOseSshh 2017-07-02 1000000  1.175696e-03  0.0005042170
10:        10 x74yDs2MdT 2017-02-02 1000000  1.681614e-03  0.0020562628

Run Code Online (Sandbox Code Playgroud)

The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing animals[observations] takes more keystrokes than final_dt. So, it's up to the OP to decide whether this is worthwhile to save memory, or not.

This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g.,

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")]

Run Code Online (Sandbox Code Playgroud)

   animal_id  treatment       date       N           V2           V3
1:         1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315
2:         9 0GEOseSshh 2017-07-02 1000000  0.001175696  0.000504217

Run Code Online (Sandbox Code Playgroud)

OP's use cases

In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users:

Creation of new columns final_dt[, x2 := 1-x]: As only obervations are involved, this translates directly to observations[, x2 := 1-x].
Select using various criteria final_dt[t > 5 & treatment == "A"]: Here columns of both tables are involved. This can be implemented with data.table in different ways (note that the conditions have been amended for the actual sample data):
```
animals[observations][t < 5L & treatment %like% "MAD"]
```
Run Code Online (Sandbox Code Playgroud)
This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join.

The faster alternative is to split up the filter conditions so that observations is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally:
```
animals[observations[t < 5L]][treatment %like% "MAD"]
```
Run Code Online (Sandbox Code Playgroud)
Note that this looks quite similar to the expected syntax (with one keystroke less).

If this is deemed unacceptable by the users, the join operation can be hidden in a function:
```
# function definition
filter_dt <- function(ani_filter = "", obs_filter = "") {
  eval(parse(text = stringr::str_interp(
    'animals[observations[${obs_filter}]][${ani_filter}]')))
}

# called by user
filter_dt("treatment %like% 'MAD'", "t < 5L")
```
Run Code Online (Sandbox Code Playgroud)
```
   animal_id  treatment       date t           x          y
1:         1 MADxZ9c6fN 2017-07-02 1 -0.56047565  0.6958622
2:         1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377
3:         1 MADxZ9c6fN 2017-07-02 3  1.55870831 -3.0425688
4:         1 MADxZ9c6fN 2017-07-02 4  0.07050839  1.8488057
```
Run Code Online (Sandbox Code Playgroud)

Using factors to reduce memory footprint

Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject.

Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated:

n <- 10000L
# integer vs numeric vs logical
test_obj_size <- data.table(
  rep(1, n),
  rep(1L, n),
  rep(TRUE, n))

str(test_obj_size)

Run Code Online (Sandbox Code Playgroud)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  3 variables:
 $ V1: num  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V3: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, ".internal.selfref")=<externalptr>

Run Code Online (Sandbox Code Playgroud)

sapply(test_obj_size, object.size)

Run Code Online (Sandbox Code Playgroud)

   V1    V2    V3 
80040 40040 40040
Run Code Online (Sandbox Code Playgroud)

Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character L.

Also the memory consumption of character strings can be reduced if they are coerced to factor:

# character vs factor
test_obj_size <- data.table(
  rep("A", n),
  rep("AAAAAAAAAAA", n),
  rep_len(LETTERS, n),
  factor(rep("A", n)),
  factor(rep("AAAAAAAAAAA", n)),
  factor(rep_len(LETTERS, n)))

str(test_obj_size)

Run Code Online (Sandbox Code Playgroud)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  6 variables:
 $ V1: chr  "A" "A" "A" "A" ...
 $ V2: chr  "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ...
 $ V3: chr  "A" "B" "C" "D" ...
 $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
 $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ...
 $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

Run Code Online (Sandbox Code Playgroud)

sapply(test_obj_size, object.size)

Run Code Online (Sandbox Code Playgroud)

   V1    V2    V3    V4    V5    V6 
80088 80096 81288 40456 40464 41856
Run Code Online (Sandbox Code Playgroud)

Stored as factor, only half of the memory is required.

The same holds for Date and POSIXct classes:

# Date & POSIXct vs factor
test_obj_size <- data.table(
  rep(as.Date(Sys.time()), n),
  rep(as.POSIXct(Sys.time()), n),
  factor(rep(as.Date(Sys.time()), n)),
  factor(rep(as.POSIXct(Sys.time()), n)))

str(test_obj_size)

Run Code Online (Sandbox Code Playgroud)

Classes ‘data.table’ and 'data.frame':    10000 obs. of  4 variables:
 $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ...
 $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ...
 $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

Run Code Online (Sandbox Code Playgroud)

sapply(test_obj_size, object.size)

Run Code Online (Sandbox Code Playgroud)

   V1    V2    V3    V4 
80248 80304 40464 40480
Run Code Online (Sandbox Code Playgroud)

Note that data.table() refuses to create a column of class POSIXlt as it is stored in 40 bytes instead of 8 bytes.

So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

Answer 2

CPa*_*Pak 3

您应该考虑使用嵌套的 data.frame

library(tidyverse)

Run Code Online (Sandbox Code Playgroud)

使用一个玩具示例，其中我rbind有 4 个副本mtcars

new <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg)
object.size(new)
11384 bytes

Run Code Online (Sandbox Code Playgroud)

如果我们对数据进行分组（您可能会这样做来汇总值），则大小会增加一点

grp <- rbind(mtcars,mtcars,mtcars,mtcars)%>% 
         select(cyl,mpg) %>% 
         group_by(cyl)
object.size(grp)    
14272 bytes

Run Code Online (Sandbox Code Playgroud)

如果我们也嵌套数据

alt <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg) %>% 
         group_by(cyl) %>% 
         nest(mpg)
object.size(alt)
4360 bytes

Run Code Online (Sandbox Code Playgroud)

您会显着减小对象大小。

注意在这种情况下，您必须有许多重复值才能节省内存；例如，nested单个副本的mtcars内存大小比单个副本大正常副本的内存大小更大mtcars

-----您的案例-----

alt1 <- final_dt %>%
         group_by(animal_id, treatment, date) %>%
         nest()

Run Code Online (Sandbox Code Playgroud)

看起来像

alt1
  animal_id treatment                date              data
1         1         A 2017-02-21 20:00:00 <tibble [20 x 3]>
2         1         B 2017-02-21 22:00:00 <tibble [20 x 3]>

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，1 月前
查看次数：	898 次
最近记录：	8 年，1 月前