我正在使用 R 编程语言。
我有以下数据集(“df”):
df <- structure(list(student = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
var1 = c("a", "b", "b", "a", "c", "a", "b"),
start = structure(c(14610, 14610, 15869, 17439, 14610, 16436, 17897), class = "Date"),
end = structure(c(15706, 15706, 16679, 17723, 16071, 17492, 18791), class = "Date")),
row.names = c(NA, -7L), class = "data.frame")
student var1 start end
1 1 a 2010-01-01 2013-01-01
2 1 b 2010-01-01 2013-01-01
3 1 b 2013-06-13 2015-09-01
4 1 a 2017-09-30 2018-07-11
5 2 c 2010-01-01 2014-01-01
6 2 a 2015-01-01 2017-11-22
7 2 b 2019-01-01 2021-06-13
Run Code Online (Sandbox Code Playgroud)
我的问题:使用 R 中的 SQLDF 库,我尝试运行以下代码(以查看每年(从 2010 年 3 月 1 日到 2020 年 3 月 1 日)每个学生是否至少有一个 var1=a 实例):
library(sqldf)
sqldf("WITH date_ranges AS (
SELECT '2010-03-01' AS start_date, '2011-03-01' AS end_date UNION ALL
SELECT '2011-03-01', '2012-03-01' UNION ALL
SELECT '2012-03-01', '2013-03-01' UNION ALL
SELECT '2013-03-01', '2014-03-01' UNION ALL
SELECT '2014-03-01', '2015-03-01' UNION ALL
SELECT '2015-03-01', '2016-03-01' UNION ALL
SELECT '2016-03-01', '2017-03-01' UNION ALL
SELECT '2017-03-01', '2018-03-01' UNION ALL
SELECT '2018-03-01', '2019-03-01' UNION ALL
SELECT '2019-03-01', '2020-03-01'
),
joined_data AS (
SELECT
t.student,
d.start_date,
d.end_date,
t.var1
FROM
df t
JOIN
date_ranges d
ON
t.start <= d.end_date AND t.end >= d.start_date
)
select * from joined_data;")
Run Code Online (Sandbox Code Playgroud)
但是,这会返回 NULL 结果:
[1] student start_date end_date var1
<0 rows> (or 0-length row.names)
Run Code Online (Sandbox Code Playgroud)
有人可以帮助我理解为什么会发生这种情况以及我可以做些什么来避免这种情况吗?
谢谢!
这是我用来解决这个问题的完整代码
sqldf("WITH date_ranges AS (
SELECT '2010-03-01' AS start_date, '2011-03-01' AS end_date UNION ALL
SELECT '2011-03-01', '2012-03-01' UNION ALL
SELECT '2012-03-01', '2013-03-01' UNION ALL
SELECT '2013-03-01', '2014-03-01' UNION ALL
SELECT '2014-03-01', '2015-03-01' UNION ALL
SELECT '2015-03-01', '2016-03-01' UNION ALL
SELECT '2016-03-01', '2017-03-01' UNION ALL
SELECT '2017-03-01', '2018-03-01' UNION ALL
SELECT '2018-03-01', '2019-03-01' UNION ALL
SELECT '2019-03-01', '2020-03-01'
),
joined_data AS (
SELECT
t.student,
d.start_date,
d.end_date,
t.var1
FROM
df t
JOIN
date_ranges d
ON
t.start <= d.end_date AND t.end >= d.start_date
), var1_counts AS (
SELECT
student,
start_date,
end_date,
COUNT(CASE WHEN var1 = 'a' THEN 1 END) AS var1_count
FROM
joined_data
GROUP BY
student, start_date, end_date
)
SELECT
student,
start_date,
end_date,
CASE WHEN var1_count > 0 THEN 'Yes' ELSE 'No' END AS at_least_one_var1_a
FROM
var1_counts;
select * from joined_data;")
Run Code Online (Sandbox Code Playgroud)
小智 5
start创建后将和列转换end为字符df,代码应该可以正常工作:
df$start <- as.character(df$start)
df$end <- as.character(df$end)
Run Code Online (Sandbox Code Playgroud)
原始代码的问题在于 和start_date直接end_date取自structure()as 整数,这导致日期转换和比较失败。核实:
> df <- structure(list(student = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
+ var1 = c("a", "b", "b", "a", "c", "a", "b"),
+ start = structure(c(14610, 14610, 15869, 17439, 14610, 16436, 17897), class = "Date"),
+ end = structure(c(15706, 15706, 16679, 17723, 16071, 17492, 18791), class = "Date")),
+ row.names = c(NA, -7L), class = "data.frame")
> sqldf("SELECT
+ start,
+ STRFTIME('%s', start) AS start_strf,
+ end,
+ STRFTIME('%s', end) AS end_strf
+ FROM df")
start start_strf end end_strf
1 2010-01-01 -209604456000 2013-01-01 -209509761600
2 2010-01-01 -209604456000 2013-01-01 -209509761600
3 2013-06-13 -209495678400 2015-09-01 -209425694400
4 2017-09-30 -209360030400 2018-07-11 -209335492800
5 2010-01-01 -209604456000 2014-01-01 -209478225600
6 2015-01-01 -209446689600 2017-11-22 -209355451200
7 2019-01-01 -209320459200 2021-06-13 -209243217600
Run Code Online (Sandbox Code Playgroud)
我们可以看到,虽然start和end时间的格式似乎正确,但 POSIXct 时间都是负数。这意味着在 SQLite 中计算日期时可能会采用整数值而不是日期字符串(例如, ( sqldf("SELECT STRFTIME('%s', 14610)")returns -209604456000))。
为了解决这个问题,我们需要确保传递到 SQL 的数据采用以下YYYY-MM-DD格式(SQLite 将日期表示为 ISO8601 字符串、儒略日期或 POSIXct 时间戳 -文档)。我们可以通过将数据帧中的start和end列转换df为文字日期字符串来实现这一点。
df$start <- as.character(df$start)
df$end <- as.character(df$end)
Run Code Online (Sandbox Code Playgroud)
然后我们重试上面的代码:
> df <- structure(list(student = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
+ var1 = c("a", "b", "b", "a", "c", "a", "b"),
+ start = structure(c(14610, 14610, 15869, 17439, 14610, 16436, 17897), class = "Date"),
+ end = structure(c(15706, 15706, 16679, 17723, 16071, 17492, 18791), class = "Date")),
+ row.names = c(NA, -7L), class = "data.frame")
> df$end <- as.character(df$end)
> df$start <- as.character(df$start)
> sqldf("SELECT
+ start,
+ STRFTIME('%s', start) AS start_strf,
+ end,
+ STRFTIME('%s', end) AS end_strf
+ FROM df")
start start_strf end end_strf
1 2010-01-01 1262304000 2013-01-01 1356998400
2 2010-01-01 1262304000 2013-01-01 1356998400
3 2013-06-13 1371081600 2015-09-01 1441065600
4 2017-09-30 1506729600 2018-07-11 1531267200
5 2010-01-01 1262304000 2014-01-01 1388534400
6 2015-01-01 1420070400 2017-11-22 1511308800
7 2019-01-01 1546300800 2021-06-13 1623542400
Run Code Online (Sandbox Code Playgroud)
现在我们有了正确的时间格式,您的代码应该可以正常工作:
> sqldf("WITH date_ranges AS (
+ SELECT '2010-03-01' AS start_date, '2011-03-01' AS end_date UNION ALL
+ SELECT '2011-03-01', '2012-03-01' UNION ALL
+ SELECT '2012-03-01', '2013-03-01' UNION ALL
+ SELECT '2013-03-01', '2014-03-01' UNION ALL
+ SELECT '2014-03-01', '2015-03-01' UNION ALL
+ SELECT '2015-03-01', '2016-03-01' UNION ALL
+ SELECT '2016-03-01', '2017-03-01' UNION ALL
+ SELECT '2017-03-01', '2018-03-01' UNION ALL
+ SELECT '2018-03-01', '2019-03-01' UNION ALL
+ SELECT '2019-03-01', '2020-03-01'
+ ),
+ joined_data AS (
+ SELECT
+ t.student,
+ d.start_date,
+ d.end_date,
+ t.var1
+ FROM
+ df t
+ JOIN
+ date_ranges d
+ ON
+ strftime('%s', t.start) <= strftime('%s', d.end_date) AND strftime('%s', t.end) >= strftime('%s', d.start_date)
+ ), var1_counts AS (
+ SELECT
+ student,
+ start_date,
+ end_date,
+ COUNT(CASE WHEN var1 = 'a' THEN 1 END) AS var1_count
+ FROM
+ joined_data
+ GROUP BY
+ student, start_date, end_date
+ )
+ SELECT
+ student,
+ start_date,
+ end_date,
+ CASE WHEN var1_count > 0 THEN 'Yes' ELSE 'No' END AS at_least_one_var1_a
+ FROM
+ var1_counts;")
student start_date end_date at_least_one_var1_a
1 1 2010-03-01 2011-03-01 Yes
2 1 2011-03-01 2012-03-01 Yes
3 1 2012-03-01 2013-03-01 Yes
4 1 2013-03-01 2014-03-01 No
5 1 2014-03-01 2015-03-01 No
6 1 2015-03-01 2016-03-01 No
7 1 2017-03-01 2018-03-01 Yes
8 1 2018-03-01 2019-03-01 Yes
9 2 2010-03-01 2011-03-01 No
10 2 2011-03-01 2012-03-01 No
11 2 2012-03-01 2013-03-01 No
12 2 2013-03-01 2014-03-01 No
13 2 2014-03-01 2015-03-01 Yes
14 2 2015-03-01 2016-03-01 Yes
15 2 2016-03-01 2017-03-01 Yes
16 2 2017-03-01 2018-03-01 Yes
17 2 2018-03-01 2019-03-01 No
18 2 2019-03-01 2020-03-01 No
Run Code Online (Sandbox Code Playgroud)