Redshift PostgreSQL Distinct ON 运算符

Ber*_*a2k 5 postgresql distinct distinct-on amazon-redshift postgresql-8.0

我有一个数据集,我想对其进行解析以查看多点触控属性。数据集由响应营销活动的潜在客户及其营销来源组成。

每个潜在客户可以响应多个营销活动,我希望在同一个表中获得他们的第一个营销来源和最后一个营销来源。

我在想我可以创建两个表并使用两个表中的 select 语句。第一个表将尝试创建一个表,其中包含每个人的最新营销来源(使用电子邮件作为他们的唯一 ID)。

create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);
Run Code Online (Sandbox Code Playgroud)

然后我会创建一个包含重复数据删除电子邮件的表,但这次是第一个来源。

create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);
Run Code Online (Sandbox Code Playgroud)

最后,我想简单地选择电子邮件并将第一个和最后一个市场来源加入他们自己的列中。

select a.email, a.last_source, b.first_source, a.date 
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email
Run Code Online (Sandbox Code Playgroud)

由于distinct on 不适用于redshift 的postgresql 版本,我希望有人有想法以另一种方式解决这个问题。

编辑 2/22:有关更多背景信息,我正在与他们回应的人和活动打交道。每条记录都是一个“活动响应”,每个人都可以拥有多个来源的多个活动响应。我正在尝试创建一个选择语句,该语句将按人进行重复数据删除,然后为他们响应的第一个活动/营销来源和他们分别响应的最后一个活动/营销来源设置列。

编辑 2/24:理想的输出是一个有 4 列的表:email、last_source、first_source、date。

第一个和最后一个源列对于只有 1 个营销活动成员记录的人是相同的,而对于拥有超过 1 个营销活动成员记录的每个人来说都是不同的。

Use*_*ady 4

我相信您可以在 case 表达式内使用 row_number() ,如下所示:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM sf.campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email
Run Code Online (Sandbox Code Playgroud)

在这里测试:SQL Fiddle

PostgreSQL 9.3 架构设置

CREATE TABLE campaignmember
    (email varchar(3), date timestamp, market_source varchar(1))
;

INSERT INTO campaignmember
    (email, date, market_source)
VALUES
    ('a@a', '2016-01-02 00:00:00', 'x'),
    ('a@a', '2016-01-03 00:00:00', 'y'),
    ('a@a', '2016-01-04 00:00:00', 'z'),
    ('b@b', '2016-01-02 00:00:00', 'x')
;
Run Code Online (Sandbox Code Playgroud)

查询1

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email
Run Code Online (Sandbox Code Playgroud)

结果

| email | first_source |                first_date | last_source |                 last_date |
|-------|--------------|---------------------------|-------------|---------------------------|
|   a@a |            x | January, 02 2016 00:00:00 |           z | January, 04 2016 00:00:00 |
|   b@b |            x | January, 02 2016 00:00:00 |           x | January, 02 2016 00:00:00 |
Run Code Online (Sandbox Code Playgroud)

& 对请求的一个小扩展,计算接触点的数量。

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
    , MAX(numof) AS Numberof_Contacts 
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
          , COUNT(*) OVER (PARTITION BY email) as numof
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email
Run Code Online (Sandbox Code Playgroud)