SQL:比较两个表是否缺少记录,然后是日期字段

Question

SQL:比较两个表是否缺少记录,然后是日期字段

我有两张桌子如下

work_assignments

emp_id   | start_date  |   End Date
------------------------------------------
  1      | May-10-2017 | May-30-2017
  1      | Jun-05-2017 | null
  2      | May-08-2017 | null

Run Code Online (Sandbox Code Playgroud)

hourly_pay

emp_id   | start_date  |   End Date    |  Rate
-----------------------------------------------
  1      | May-20-2017 | Jun-30-2017   |  75
  1      | Jul-01-2017 | null          |  80

Run Code Online (Sandbox Code Playgroud)

这两个表共享emp_id(员工ID)外键并加入这两个我应该能够:

找到hourly_pay表中缺少的员工记录.根据此处的数据,查询应从work_assignments表返回emp_id 2
找到hourly_pay start_date晚于工作分配start_date的记录.同样,根据此处的数据,查询应该返回emp_id 1(因为work_assignments.start_date具有May-10-2017,而最早的hourly_pay.start_date是在2017年5月20日)

我可以使用下面的连接查询来实现结果的第一部分

select distinct emp_id from work_contracts
left join hourly_pay hr USING(emp_id)
where hr.emp_id is null

Run Code Online (Sandbox Code Playgroud)

我被困在第二部分,我可能需要一个相关的子查询来告诉在work_assignments start_date之前没有启动的每小时工资表记录？或者还有其他方法吗？

Answer 1

Ham*_*one 2

这暗示了一个between条件，有一些曲折，但我在连接中使用之间的运气非常糟糕。它们似乎在后端执行某种形式的交叉连接，然后过滤掉实际的连接 where 子句样式。我知道这不是很技术性的，但我从来没有在结果良好的连接中做过非相等条件。

所以，这可能看起来违反直觉，但我认为爆炸所有约会的可能性实际上可能是你最好的选择。如果不知道您的日期范围实际上有多大，就很难说。

另外，我认为这实际上会同时满足您问题中的两个条件 - 通过告诉您所有没有相应工资率的工作任务。

根据您的实际数据尝试此操作，看看它是如何工作的（以及需要多长时间）。

with pay_dates as (
  select
    emp_id, rate,
    generate_series (start_date, coalesce (end_date, current_date), interval '1 day') as pd
  from hourly_pay
),
assignment_dates as (
  select
    emp_id, start_date,
    generate_series (start_date, coalesce (end_date, current_date), interval '1 day') as wd
  from work_assignments
)
select
  emp_id, min (wd)::date as from_date,
  max (wd)::date as thru_date
from
  assignment_dates a
where
  not exists (
    select null
    from pay_dates p
    where p.emp_id = a.emp_id
    and a.wd = p.pd
  )
group by
  emp_id, start_date

Run Code Online (Sandbox Code Playgroud)

结果应该是所有工作分配范围，没有比率：

emp     from             thru
1    '2017-05-10'    '2017-05-19'
2    '2017-05-08'    '2017-11-14'

Run Code Online (Sandbox Code Playgroud)

最酷的是，它还可以消除部分重叠的工作任务。

-- 编辑 2018 年 3 月 20 日 --

根据您的要求，这里是逻辑功能的细分。

with pay_dates as(
  select
    emp_id, rate,
    generate_series (start_date, coalesce (end_date, current_date), interval '1 day') as pd
  from hourly_pay
)

Run Code Online (Sandbox Code Playgroud)

这将获取 hourly_pay 数据并将其分解为每个员工每天的记录：

emp_id    rate    pay date
1         75      5/20/17
1         75      5/21/17
1         75      5/22/17
...
1         75      6/30/17
1         80      6/01/17
1         80      6/02/17
...
1         80      today

Run Code Online (Sandbox Code Playgroud)

下一个，

[implied "with"]
assignment_dates as (
  select
    emp_id, start_date,
    generate_series (start_date, coalesce (end_date, current_date), interval '1 day') as wd
  from work_assignments
)

Run Code Online (Sandbox Code Playgroud)

实际上对工作分配表执行相同的操作，仅保留每行中的“开始日期列”。

那么主要的查询是这样的：

select
  emp_id, min (wd)::date as from_date,
  max (wd)::date as thru_date
from
  assignment_dates a
where
  not exists (
    select null
    from pay_dates p
    where p.emp_id = a.emp_id
    and a.wd = p.pd
  )
group by
  emp_id, start_date

Run Code Online (Sandbox Code Playgroud)

这是从上面的两个查询得出的。重要的部分是反连接：

not exists (
  select null
  from pay_dates p
  where p.emp_id = a.emp_id
  and a.wd = p.pd
)

Run Code Online (Sandbox Code Playgroud)

这标识了当天该员工没有相应记录的每项工作分配。

因此，本质上，查询从两个表中获取数据范围，得出每个可能的日期组合，然后执行反连接以查看它们不匹配的位置。

虽然将一条记录分解为多条记录似乎违反直觉，但需要考虑两件事：

日期是非常有限的生物——即使是 10 年的数据也只包含 4,000 条左右的记录，这对于数据库来说并不算多，即使乘以员工数据库也是如此。你的时间范围看起来比那个少得多。
我使用 = 以外的连接的运气非常非常糟糕，例如between或>。看起来它在后台进行笛卡尔运算，然后过滤结果。相比之下，爆炸范围至少可以让您对发生的数据爆炸量有一定的控制。

为了笑，我用上面的示例数据做了这个，并得出了这个，它实际上看起来很准确：

1   '2017-05-10'    '2017-05-19'
2   '2017-05-08'    '2018-03-20'

Run Code Online (Sandbox Code Playgroud)

如果有任何不清楚的地方请告诉我。

归档时间：	7 年，9 月前
查看次数：	494 次
最近记录：	7 年，5 月前