lubridate :: interval对象中身份的评估错误

tmf*_*mnk 1 r lubridate tidyverse

假设这样的df:

df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))

   id      time1      time2
1   1 2008-10-12 2009-03-20
2   1 2008-08-10 2009-06-15
3   2 2006-01-09 2006-02-13
4   2 2008-03-13 2008-04-17
5   3 2008-09-12 2008-10-17
6   3 2007-05-30 2007-07-04
7   4 2003-09-29 2004-01-15
8   4 2003-09-29 2004-01-15
9   5 2003-04-01 2003-07-04
10  5 2003-04-01 2003-07-04
Run Code Online (Sandbox Code Playgroud)

我尝试做的是,首先lubridate在变量“ time1”和“ time2”之间创建一个间隔。其次,我要按“ id”分组,比较下一行是否与当前行相同,以及当前行是否与上一行相同。我可以做到:

library(tidyverse)

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = interval(time1, time2)) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0))

      id time1      time2      overlap                        cond1 cond2
   <int> <date>     <date>     <S4: Interval>                 <dbl> <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     1    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     1
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     1    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     1
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1
Run Code Online (Sandbox Code Playgroud)

如您所见,问题是对于id == 2和id == 3,即使间隔不相同,两个条件都被评估为TRUE。对于id == 1,它正确地评估为FALSE,对于id === 4和id == 5,它正确评估为TRUE。

现在,当我将间隔转换为字符时,它会对其进行正确评估:

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = as.character(interval(time1, time2))) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0)) 

      id time1      time2      overlap                        cond1 cond2
   <int> <date>     <date>     <chr>                          <dbl> <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     0    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     0
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     0    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     0
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1
Run Code Online (Sandbox Code Playgroud)

问题是,为什么不将某些间隔评估为相同?

hmh*_*sen 7

我认为这与lubridate实际计算有关。

当我计算之间的差异date1date2,发生这种情况:

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = time2 - time1)

   id      time1      time2  overlap
1   1 2008-10-12 2009-03-20 159 days
2   1 2008-08-10 2009-06-15 309 days
3   2 2006-01-09 2006-02-13  35 days
4   2 2008-03-13 2008-04-17  35 days
5   3 2008-09-12 2008-10-17  35 days
6   3 2007-05-30 2007-07-04  35 days
7   4 2003-09-29 2004-01-15 108 days
8   4 2003-09-29 2004-01-15 108 days
9   5 2003-04-01 2003-07-04  94 days
10  5 2003-04-01 2003-07-04  94 days
Run Code Online (Sandbox Code Playgroud)

这样我们就可以知道间隔时间在一天中是相同的。

现在,overlap实际计算的是什么?为了找出答案,我对您的代码做了些微更改以报告超前和滞后,而不是1。

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2)) %>%
  group_by(id) %>%
  mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
         cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))

# A tibble: 10 x 6
# Groups:   id [5]
      id time1      time2      overlap                          cond1   cond2
   <int> <date>     <date>     <S4: Interval>                   <dbl>   <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC       0      NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC      NA       0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000      NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC      NA 3024000
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000      NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC      NA 3024000
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200      NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC      NA 9331200
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600      NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC      NA 8121600
Run Code Online (Sandbox Code Playgroud)

在这里,我们看到了这一点,lead并且lag实际上是在特定时间间隔内计算差异,而不是查看实际间隔的开始和结束日期。这样看来,为什么它认为某些间隔应该相等,而字符串却不相等。

一些更多的挖掘:

让我们看一下由产生的对象interval

a <- interval(df$time1, df$time2)

str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"
Run Code Online (Sandbox Code Playgroud)

这是一个S4级有三个插槽:.Datastarttzone

呼叫a显示了时间间隔。

a
 [1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
 [4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
 [7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC
Run Code Online (Sandbox Code Playgroud)

但是,当您对进行计算时a,它是在进行的.Data,这是从指定日期开始的一系列秒数(请参阅参考资料?interval)。

a@.Data
#[1] 13737600 26697600  3024000  3024000  3024000  3024000  9331200  9331200  8121600  8121600
Run Code Online (Sandbox Code Playgroud)

对于间隔的开始日期,我们需要访问start广告位。

a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"
Run Code Online (Sandbox Code Playgroud)

还有时区

a@tzone
#[1] "UTC"
Run Code Online (Sandbox Code Playgroud)

我们还可以查看元素之间的关系。最后一个元素和最后一个元素具有相同的间隔。

a[9] == a[10]
#[1] TRUE
Run Code Online (Sandbox Code Playgroud)

它们是相同的对象。

identical(a[9], a[10])
#[1] TRUE
Run Code Online (Sandbox Code Playgroud)

但是,当您检查元素是否相等时,真正检查的是什么呢?元素3和4具有相同的时间差,但间隔不同。因此,当您检查其滞后/超前是否相等时,它将返回TRUE。但是由于它们的间隔日期不同,所以不应该这样。因此,当我们检查它们是否相同时,才可以得到我们期望的结果。

a[3] == a[4]
#[1] TRUE

a[3]@.Data == a[4]@.Data
#[1] TRUE

identical(a[3], a[4])
#[1] FALSE
Run Code Online (Sandbox Code Playgroud)

所以发生了什么事?什么a[3] == a[4]真正的检查是a[3]@.Data == a[4]@.Data因此它的检查,看是否3024000平等3024000。这样它就返回了TRUE。但是相同检查所有插槽,发现它们不相同,因为start每个插槽都不相同。

然后,我考虑过使用与超前/滞后相同的方法,以便我们可以在代码中加入一种逻辑,但请看一下。

a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC

# now lead
lead(a[9])
#2003-04-01 UTC--NA
Run Code Online (Sandbox Code Playgroud)

输出看起来不像a[10]预期的那样。

#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"
Run Code Online (Sandbox Code Playgroud)

所以leadlag对类S4对象有不同的影响。为了更好地处理您的第一次尝试输出的内容,我这样做是:

df %>%
     mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
     mutate(overlap = interval(time1, time2)) %>%
     group_by(id) %>%
     mutate(cond1 = lead(overlap),
            cond2 = lag(overlap))
Run Code Online (Sandbox Code Playgroud)

我收到很多警告信息,说

#In mutate_impl(.data, dots) :
#  Vectorizing 'Interval' elements may not preserve their attributes
Run Code Online (Sandbox Code Playgroud)

我对R对象了解不足,无法理解S4类中的数据是如何存储的,但是它看上去与典型的S3对象不同。

就像as.character您一样,似乎是使用的方法。