Dav*_*ave 37 sql t-sql sql-server join greatest-n-per-group
我有一个针对大量连接的大表(行和列)的查询,但是其中一个表有一些重复的数据行导致我的查询出现问题.由于这是来自其他部门的只读实时订阅源,因此我无法修复该数据,但我正在尝试阻止查询中的问题.
鉴于此,我需要将此垃圾数据作为左连接添加到我的好查询中.数据集如下所示:
IDNo FirstName LastName ...
-------------------------------------------
uqx bob smith
abc john willis
ABC john willis
aBc john willis
WTF jeff bridges
sss bill doe
ere sally abby
wtf jeff bridges
...
Run Code Online (Sandbox Code Playgroud)
(约24列,100K行)
我的第一直觉是执行一个明显的给了我大约80K行:
SELECT DISTINCT P.IDNo
FROM people P
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试以下操作时,我会收到所有行:
SELECT DISTINCT P.*
FROM people P
Run Code Online (Sandbox Code Playgroud)
要么
SELECT
DISTINCT(P.IDNo) AS IDNoUnq
,P.FirstName
,P.LastName
...etc.
FROM people P
Run Code Online (Sandbox Code Playgroud)
然后我想我会在所有列上执行FIRST()聚合函数,但是这也感觉不对.从语法上讲,我在这里做错了吗?
更新: 只是想注意:这些记录是基于上面列出的非密钥/非索引字段ID的重复记录.ID是一个文本字段,虽然具有相同的值,但它与导致该问题的其他数据的情况不同.
a_h*_*ame 41
distinct是不是一个函数.它始终在选择列表的所有列上运行.
您的问题是典型的"每组最大N"问题,可以使用窗口函数轻松解决:
select ...
from (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) t
where rn = 1;
Run Code Online (Sandbox Code Playgroud)
使用该order by子句,您可以选择要选择的重复项.
以上可用于左连接:
select ...
from x
left join (
select IDNo,
FirstName,
LastName,
....,
row_number() over (partition by lower(idno) order by firstname) as rn
from people
) p on p.idno = x=idno and p.rn = 1
where ...
Run Code Online (Sandbox Code Playgroud)
使用交叉应用或外部应用,这样您可以限制从具有重复项的表中连接到第一次命中的数据量。
Select
x.*,
c.*
from
x
Cross Apply
(
Select
Top (1)
IDNo,
FirstName,
LastName,
....,
from
people As p
where
p.idno = x.idno
Order By
p.idno //unnecessary if you don't need a specific match based on order
) As c
Run Code Online (Sandbox Code Playgroud)
交叉应用的行为类似于内连接,外部应用的行为类似于左连接
经过仔细考虑,这个困境有几种不同的解决方案:
聚合所有内容 对每列使用聚合来获取最大或最小字段值。这就是我正在做的事情,因为它需要 2 条部分填写的记录并“合并”数据。
http://sqlfiddle.com/#!3/59cde/1
SELECT
UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY
IDNo
Run Code Online (Sandbox Code Playgroud)
获取第一条(或最后一条记录)
http://sqlfiddle.com/#!3/59cde/23
-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo: Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------
SELECT
PC.user_id
,PData.FirstName
,PData.LastName
,PData.entry
FROM (
SELECT
P2.user_id
,MIN(P2.entry) AS rownum
FROM (
SELECT
UPPER(P.IDNo) AS user_id
, P.entry
FROM people P
) AS P2
GROUP BY
P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY
PData.entry
Run Code Online (Sandbox Code Playgroud)
小智 6
添加标识列 (PeopleID),然后使用相关子查询返回每个值的第一个值。
SELECT *
FROM People p
WHERE PeopleID = (
SELECT MIN(PeopleID)
FROM People
WHERE IDNo = p.IDNo
)
Run Code Online (Sandbox Code Playgroud)
事实证明我做错了,我需要首先对重要的列执行嵌套选择,然后进行不同的选择,以防止“唯一”数据的垃圾列损坏我的好数据。以下似乎已经解决了问题...但我稍后会尝试完整的数据集。
SELECT DISTINCT P2.*
FROM (
SELECT
IDNo
, FirstName
, LastName
FROM people P
) P2
Run Code Online (Sandbox Code Playgroud)
这是根据要求提供的一些播放数据:http://sqlfiddle.com/#! 3/050e0d/3
CREATE TABLE people
(
[entry] int
, [IDNo] varchar(3)
, [FirstName] varchar(5)
, [LastName] varchar(7)
);
INSERT INTO people
(entry,[IDNo], [FirstName], [LastName])
VALUES
(1,'uqx', 'bob', 'smith'),
(2,'abc', 'john', 'willis'),
(3,'ABC', 'john', 'willis'),
(4,'aBc', 'john', 'willis'),
(5,'WTF', 'jeff', 'bridges'),
(6,'Sss', 'bill', 'doe'),
(7,'sSs', 'bill', 'doe'),
(8,'ssS', 'bill', 'doe'),
(9,'ere', 'sally', 'abby'),
(10,'wtf', 'jeff', 'bridges')
;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
101898 次 |
| 最近记录: |